第4章

最新推荐文章于 2024-08-15 01:55:01 发布

m0_37351072

最新推荐文章于 2024-08-15 01:55:01 发布

阅读量300

点赞数

文章标签： python

本文链接：https://blog.csdn.net/m0_37351072/article/details/105831024

版权

import pandas  as pd

import numpy as np

from pandas import Series,DataFrame

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

data =  pd.read_csv("data/table.csv")

data.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

2.piovt_table

pd.pivot_table(data,index="ID",columns="Gender",values="Height").head()
# index为聚合的行
# columns为聚合的列
# values为每个单元格要计算的值

Gender	F	M
ID
1101	NaN	173.0
1102	192.0	NaN
1103	NaN	186.0
1104	167.0	NaN
1105	159.0	NaN

Pandas中提供了各种选项，下面介绍常用参数：

① aggfunc：对组内进行聚合统计，可传入各类函数，默认为’mean’

pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc=["sum","max","mean"])

	sum		max		mean
Gender	F	M	F	M	F	M
School
S_1	1385	1251	192	195	173.125000	178.714286
S_2	1911	1548	194	193	173.727273	172.000000

② margins：汇总边际状态

由以下可以看出margins的汇总值是和aggfunc的函数有关的

display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="mean",margins=True,margins_name="合计"))
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="sum",margins=True,margins_name="合计"))
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="max",margins=True,margins_name="合计"))
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="median",margins=True,margins_name="合计"))
# display(pd.pivot_table(data,index="School",columns="Gender",values="Height",margins=True,margins_name="合计"))
# ？？？？？？？？？？？？？
# 具体是如何的呢？

Gender	F	M	合计
School
S_1	173.125000	178.714286	175.733333
S_2	173.727273	172.000000	172.950000
合计	173.473684	174.937500	174.142857

Gender	F	M	合计
School
S_1	1385	1251	2636
S_2	1911	1548	3459
合计	3296	2799	6095

Gender	F	M	合计
School
S_1	192	195	195
S_2	194	193	194
合计	194	195	195

Gender	F	M	合计
School
S_1	171	186	175.0
S_2	164	171	170.5
合计	167	173	173.0

③ 行、列、值都可以为多级

pd.pivot_table(data,index=["School","Class"],columns=["Gender","Address"],values=["Height","Weight"],margins=True,margins_name="合计").T

		School	S_1			S_2				合计
		Class	C_1	C_2	C_3	C_1	C_2	C_3	C_4
	Gender	Address
Height	F	street_1	NaN	NaN	175.0	NaN	NaN	NaN	NaN	175.000000
		street_2	179.5	NaN	NaN	NaN	NaN	NaN	176.0	177.750000
		street_4	159.0	176.0	NaN	NaN	NaN	157.0	NaN	164.000000
		street_5	NaN	162.0	187.0	159.0	NaN	NaN	NaN	169.333333
		street_6	NaN	167.0	NaN	161.0	NaN	164.0	175.5	168.600000
		street_7	NaN	NaN	NaN	NaN	188.5	190.0	NaN	189.000000
	M	street_1	173.0	NaN	NaN	NaN	175.0	NaN	NaN	174.000000
		street_2	186.0	NaN	195.0	NaN	NaN	NaN	NaN	190.500000
		street_4	NaN	NaN	161.0	163.5	155.0	187.0	NaN	166.000000
		street_5	NaN	188.0	NaN	NaN	193.0	171.0	NaN	184.000000
		street_6	NaN	160.0	NaN	NaN	NaN	NaN	NaN	160.000000
		street_7	NaN	NaN	188.0	174.0	NaN	NaN	166.0	176.000000
	合计		175.4	170.6	181.2	164.2	180.0	173.8	173.8	174.142857
Weight	F	street_1	NaN	NaN	57.0	NaN	NaN	NaN	NaN	57.000000
		street_2	77.0	NaN	NaN	NaN	NaN	NaN	73.0	75.000000
		street_4	64.0	94.0	NaN	NaN	NaN	78.0	NaN	78.666667
		street_5	NaN	63.0	69.0	97.0	NaN	NaN	NaN	76.333333
		street_6	NaN	63.0	NaN	61.0	NaN	81.0	57.0	63.800000
		street_7	NaN	NaN	NaN	NaN	76.5	99.0	NaN	84.000000
	M	street_1	63.0	NaN	NaN	NaN	74.0	NaN	NaN	68.500000
		street_2	82.0	NaN	70.0	NaN	NaN	NaN	NaN	76.000000
		street_4	NaN	NaN	68.0	71.0	91.0	73.0	NaN	74.800000
		street_5	NaN	68.0	NaN	NaN	100.0	88.0	NaN	85.333333
		street_6	NaN	53.0	NaN	NaN	NaN	NaN	NaN	53.000000
		street_7	NaN	NaN	82.0	84.0	NaN	NaN	82.0	82.666667
	合计		72.6	68.2	69.2	76.8	83.6	83.8	68.4	74.657143

help(pd.pivot_table)

Help on function pivot_table in module pandas.core.reshape.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)
    Create a spreadsheet-style pivot table as a DataFrame. The levels in
    the pivot table will be stored in MultiIndex objects (hierarchical
    indexes) on the index and columns of the result DataFrame.
    
    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table column.  If an array is passed,
        it is being used as the same manner as column values.
    aggfunc : function, list of functions, dict, default numpy.mean
        If list of functions passed, the resulting pivot table will have
        hierarchical columns whose top level are the function names
        (inferred from the function objects themselves)
        If dict is passed, the key is column to aggregate and value
        is function or list of functions
    fill_value : scalar, default None
        Value to replace missing values with
    margins : boolean, default False
        Add all row / columns (e.g. for subtotal / grand totals)
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    margins_name : string, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.
    observed : boolean, default False
        This only applies if any of the groupers are Categoricals.
        If True: only show observed values for categorical groupers.
        If False: show all values for categorical groupers.
    
        .. versionchanged :: 0.25.0
    
    Returns
    -------
    DataFrame
    
    See Also
    --------
    DataFrame.pivot : Pivot without aggregation that can handle
        non-numeric data.
    
    Examples
    --------
    >>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
    ...                          "bar", "bar", "bar", "bar"],
    ...                    "B": ["one", "one", "one", "two", "two",
    ...                          "one", "one", "two", "two"],
    ...                    "C": ["small", "large", "large", "small",
    ...                          "small", "large", "small", "small",
    ...                          "large"],
    ...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    ...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
    >>> df
         A    B      C  D  E
    0  foo  one  small  1  2
    1  foo  one  large  2  4
    2  foo  one  large  2  5
    3  foo  two  small  3  5
    4  foo  two  small  3  6
    5  bar  one  large  4  6
    6  bar  one  small  5  8
    7  bar  two  small  6  9
    8  bar  two  large  7  9
    
    This first example aggregates values by taking the sum.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum)
    >>> table
    C        large  small
    A   B
    bar one    4.0    5.0
        two    7.0    6.0
    foo one    4.0    1.0
        two    NaN    6.0
    
    We can also fill missing values using the `fill_value` parameter.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum, fill_value=0)
    >>> table
    C        large  small
    A   B
    bar one      4      5
        two      7      6
    foo one      4      1
        two      0      6
    
    The next example aggregates by taking the mean across multiple columns.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': np.mean})
    >>> table
                    D         E
    A   C
    bar large  5.500000  7.500000
        small  5.500000  8.500000
    foo large  2.000000  4.500000
        small  2.333333  4.333333
    
    We can also calculate multiple types of aggregations for any given
    value column.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': [min, max, np.mean]})
    >>> table
                    D    E
                mean  max      mean  min
    A   C
    bar large  5.500000  9.0  7.500000  6.0
        small  5.500000  9.0  8.500000  8.0
    foo large  2.000000  5.0  4.500000  4.0
        small  2.333333  6.0  4.333333  2.0

3、crosstab(交叉表)

交叉表是一种特殊的透视表，典型的用途如分组统计，如现在想要统计关于街道和性别分组的频数：

pd.crosstab(index=data["Gender"],columns=data["School"])

School	S_1	S_2
Gender
F	8	11
M	7	9

help(pd.crosstab)

Help on function crosstab in module pandas.core.reshape.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.
    
        .. versionadded:: 0.21.0
    
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
        .. versionadded:: 0.18.1
    
    Returns
    -------
    DataFrame
        Cross tabulation of the data.
    
    See Also
    --------
    DataFrame.pivot : Reshape data based on column values.
    pivot_table : Create a pivot table as a DataFrame.
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified.
    
    Any input passed containing Categorical data will have **all** of its
    categories included in the cross-tabulation, even if the actual data does
    not contain any instances of a particular category.
    
    In the event that there aren't overlapping indexes an empty DataFrame will
    be returned.
    
    Examples
    --------
    >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
    ...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
    >>> b = np.array(["one", "one", "one", "two", "one", "one",
    ...               "one", "two", "two", "two", "one"], dtype=object)
    >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
    ...               "shiny", "dull", "shiny", "shiny", "shiny"],
    ...              dtype=object)
    >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
    Here 'c' and 'f' are not represented in the data and will not be
    shown in the output because dropna is True by default. Set
    dropna=False to preserve categories with no data.
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> pd.crosstab(foo, bar)
    col_0  d  e
    row_0
    a      1  0
    b      0  1
    >>> pd.crosstab(foo, bar, dropna=False)
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0

pd.crosstab(index=data["Gender"],columns=data["School"],normalize=True,margins=True,margins_name="合计")

School	S_1	S_2	合计
Gender
F	0.228571	0.314286	0.542857
M	0.200000	0.257143	0.457143
合计	0.428571	0.571429	1.000000

pd.crosstab(index=data["Gender"],columns=data["School"],values= np.random.randint(1,20,data.shape[0]),aggfunc="min",normalize=True,margins=True,margins_name="合计")

School	S_1	S_2	合计
Gender
F	0.153846	0.076923	0.25
M	0.230769	0.538462	0.75
合计	0.666667	0.333333	1.00

np.random.randint(1,20,data.shape[0])

array([11, 18,  2,  4,  5,  8,  7, 14, 17, 16, 10,  7,  4,  2, 13,  3, 18,
       11, 14, 16,  6, 10, 17, 15,  2, 10, 13,  5, 14, 17,  5, 13,  7,  5,
        5])

二、其他变形方法

1. melt

melt函数可以认为是pivot函数的逆操作，将unstacked状态的数据，压缩成stacked，使“宽”的DataFrame变“窄”

data2 = data[["ID","Gender","Math"]]

pivoted = data2.pivot(index="ID",columns="Gender",values="Math")

pivoted

Gender	F	M
ID
1101	NaN	34.0
1102	32.5	NaN
1103	NaN	87.2
1104	80.4	NaN
1105	84.8	NaN
1201	NaN	97.0
1202	63.5	NaN
1203	NaN	58.8
1204	33.8	NaN
1205	68.4	NaN
1301	NaN	31.5
1302	87.7	NaN
1303	NaN	49.7
1304	NaN	85.2
1305	61.7	NaN
2101	NaN	83.3
2102	50.6	NaN
2103	NaN	52.5
2104	72.2	NaN
2105	NaN	34.2
2201	NaN	39.1
2202	68.5	NaN
2203	NaN	73.8
2204	NaN	47.2
2205	85.4	NaN
2301	72.3	NaN
2302	NaN	32.7
2303	65.9	NaN
2304	95.5	NaN
2305	NaN	48.9
2401	45.3	NaN
2402	NaN	48.7
2403	59.7	NaN
2404	67.7	NaN
2405	47.6	NaN

melt函数中的id_vars表示需要保留的列，value_vars表示需要stack的一组列

result = pivoted.reset_index().melt(id_vars="ID",value_vars=["F","M"],value_name="Math").dropna().set_index("ID").sort_index()

result

	Gender	Math
ID
1101	M	34.0
1102	F	32.5
1103	M	87.2
1104	F	80.4
1105	F	84.8
1201	M	97.0
1202	F	63.5
1203	M	58.8
1204	F	33.8
1205	F	68.4
1301	M	31.5
1302	F	87.7
1303	M	49.7
1304	M	85.2
1305	F	61.7
2101	M	83.3
2102	F	50.6
2103	M	52.5
2104	F	72.2
2105	M	34.2
2201	M	39.1
2202	F	68.5
2203	M	73.8
2204	M	47.2
2205	F	85.4
2301	F	72.3
2302	M	32.7
2303	F	65.9
2304	F	95.5
2305	M	48.9
2401	F	45.3
2402	M	48.7
2403	F	59.7
2404	F	67.7
2405	F	47.6

result.equals(data2.set_index("ID"))

True

2. 压缩与展开

（1）stack：这是最基础的变形函数，总共只有两个参数：level和dropna

data_s = pd.pivot_table(data,index=["Class","ID"],columns="Gender",values=["Height","Weight"])

data_s.head()

		Height		Weight
	Gender	F	M	F	M
Class	ID
C_1	1101	NaN	173.0	NaN	63.0
	1102	192.0	NaN	73.0	NaN
	1103	NaN	186.0	NaN	82.0
	1104	167.0	NaN	81.0	NaN
	1105	159.0	NaN	64.0	NaN

data_s.groupby("Class").head(2)

		Height		Weight
	Gender	F	M	F	M
Class	ID
C_1	1101	NaN	173.0	NaN	63.0
C_1	1102	192.0	NaN	73.0	NaN
C_2	1201	NaN	188.0	NaN	68.0
C_2	1202	176.0	NaN	94.0	NaN
C_3	1301	NaN	161.0	NaN	68.0
C_3	1302	175.0	NaN	57.0	NaN
C_4	2401	192.0	NaN	62.0	NaN
C_4	2402	NaN	166.0	NaN	82.0

data_stacked = data_s.stack()

data_stacked.groupby("Class").head(2)

			Height	Weight
Class	ID	Gender
C_1	1101	M	173.0	63.0
C_1	1102	F	192.0	73.0
C_2	1201	M	188.0	68.0
C_2	1202	F	176.0	94.0
C_3	1301	M	161.0	68.0
C_3	1302	F	175.0	57.0
C_4	2401	F	192.0	62.0
C_4	2402	M	166.0	82.0

stack函数可以看做将横向的索引放到纵向，因此功能类似与melt，参数level可指定变化的列索引是哪一层（或哪几层，需要列表）

data_stacked =  data_s.stack()

# help(pd.stack())

data_stacked.groupby("Class").head(2)

			Height	Weight
Class	ID	Gender
C_1	1101	M	173.0	63.0
C_1	1102	F	192.0	73.0
C_2	1201	M	188.0	68.0
C_2	1202	F	176.0	94.0
C_3	1301	M	161.0	68.0
C_3	1302	F	175.0	57.0
C_4	2401	F	192.0	62.0
C_4	2402	M	166.0	82.0

data_s

		Height		Weight
	Gender	F	M	F	M
Class	ID
C_1	1101	NaN	173.0	NaN	63.0
	1102	192.0	NaN	73.0	NaN
	1103	NaN	186.0	NaN	82.0
	1104	167.0	NaN	81.0	NaN
	1105	159.0	NaN	64.0	NaN
	2101	NaN	174.0	NaN	84.0
	2102	161.0	NaN	61.0	NaN
	2103	NaN	157.0	NaN	61.0
	2104	159.0	NaN	97.0	NaN
	2105	NaN	170.0	NaN	81.0
C_2	1201	NaN	188.0	NaN	68.0
	1202	176.0	NaN	94.0	NaN
	1203	NaN	160.0	NaN	53.0
	1204	162.0	NaN	63.0	NaN
	1205	167.0	NaN	63.0	NaN
	2201	NaN	193.0	NaN	100.0
	2202	194.0	NaN	77.0	NaN
	2203	NaN	155.0	NaN	91.0
	2204	NaN	175.0	NaN	74.0
	2205	183.0	NaN	76.0	NaN
C_3	1301	NaN	161.0	NaN	68.0
	1302	175.0	NaN	57.0	NaN
	1303	NaN	188.0	NaN	82.0
	1304	NaN	195.0	NaN	70.0
	1305	187.0	NaN	69.0	NaN
	2301	157.0	NaN	78.0	NaN
	2302	NaN	171.0	NaN	88.0
	2303	190.0	NaN	99.0	NaN
	2304	164.0	NaN	81.0	NaN
	2305	NaN	187.0	NaN	73.0
C_4	2401	192.0	NaN	62.0	NaN
	2402	NaN	166.0	NaN	82.0
	2403	158.0	NaN	60.0	NaN
	2404	160.0	NaN	84.0	NaN
	2405	193.0	NaN	54.0	NaN

(2) unstack：stack的逆函数，功能上类似于pivot_table

result2 = data_stacked.unstack().swaplevel(0,0,axis=1).sort_index(axis=1)

result2.equals(data_s)

True

result2

		Height		Weight
	Gender	F	M	F	M
Class	ID
C_1	1101	NaN	173.0	NaN	63.0
	1102	192.0	NaN	73.0	NaN
	1103	NaN	186.0	NaN	82.0
	1104	167.0	NaN	81.0	NaN
	1105	159.0	NaN	64.0	NaN
	2101	NaN	174.0	NaN	84.0
	2102	161.0	NaN	61.0	NaN
	2103	NaN	157.0	NaN	61.0
	2104	159.0	NaN	97.0	NaN
	2105	NaN	170.0	NaN	81.0
C_2	1201	NaN	188.0	NaN	68.0
	1202	176.0	NaN	94.0	NaN
	1203	NaN	160.0	NaN	53.0
	1204	162.0	NaN	63.0	NaN
	1205	167.0	NaN	63.0	NaN
	2201	NaN	193.0	NaN	100.0
	2202	194.0	NaN	77.0	NaN
	2203	NaN	155.0	NaN	91.0
	2204	NaN	175.0	NaN	74.0
	2205	183.0	NaN	76.0	NaN
C_3	1301	NaN	161.0	NaN	68.0
	1302	175.0	NaN	57.0	NaN
	1303	NaN	188.0	NaN	82.0
	1304	NaN	195.0	NaN	70.0
	1305	187.0	NaN	69.0	NaN
	2301	157.0	NaN	78.0	NaN
	2302	NaN	171.0	NaN	88.0
	2303	190.0	NaN	99.0	NaN
	2304	164.0	NaN	81.0	NaN
	2305	NaN	187.0	NaN	73.0
C_4	2401	192.0	NaN	62.0	NaN
	2402	NaN	166.0	NaN	82.0
	2403	158.0	NaN	60.0	NaN
	2404	160.0	NaN	84.0	NaN
	2405	193.0	NaN	54.0	NaN

三、哑变量与因子化

1. Dummy Variable（哑变量）

这里主要介绍get_dummies函数，其功能主要是进行one-hot编码：

data.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

data_d = data[["Class","Gender","Weight"]]

data_d.head()

	Class	Gender	Weight
0	C_1	M	63
1	C_1	F	73
2	C_1	M	82
3	C_1	F	81
4	C_1	F	64

pd.get_dummies(data_d[["Class","Gender"]].join(data_d["Weight"])).head()

	Weight	Class_C_1	Gender_F	Gender_M
0	63	1	0	1
1	73	1	1	0
2	82	1	0	1
3	81	1	1	0
4	64	1	1	0

2. factorize方法

该方法主要用于自然数编码，并且缺失值会被记做-1，其中sort参数表示是否排序后赋值

codes,uniques = pd.factorize(["B",None,"A","C","B","D","F","A"])

codes

array([ 0, -1,  1,  2,  0,  3,  4,  1], dtype=int64)

uniques

array(['B', 'A', 'C', 'D', 'F'], dtype=object)

#四、问题与练习

1. 问题

【问题一】上面提到了许多变形函数，如melt/crosstab/pivot/pivot_table/stack/unstack函数，请总结它们各自的使用特点。

【问题二】变形函数和多级索引是什么关系？哪些变形函数会使得索引维数变化？具体如何变化？

【问题三】请举出一个除了上文提过的关于哑变量方法的例子。

【问题四】使用完stack后立即使用unstack一定能保证变化结果与原始表完全一致吗？

【问题五】透视表中涉及了三个函数，请分别使用它们完成相同的目标（任务自定）并比较哪个速度最快。

【问题六】既然melt起到了stack的功能，为什么再设计stack函数？

2. 练习

【练习一】继续使用上一章的药物数据集：

drug = pd.read_csv("data/Drugs.csv")

drug

	YYYY	State	COUNTY	SubstanceName	DrugReports
0	2010	VA	ACCOMACK	Propoxyphene	1
1	2010	OH	ADAMS	Morphine	9
2	2010	PA	ADAMS	Methadone	2
3	2010	VA	ALEXANDRIA CITY	Heroin	5
4	2010	PA	ALLEGHENY	Hydromorphone	5
...	...	...	...	...	...
24057	2017	VA	WYTHE	Codeine	1
24058	2017	VA	WYTHE	Hydrocodone	19
24059	2017	VA	WYTHE	Tramadol	5
24060	2017	PA	YORK	ANPP	1
24061	2017	VA	YORK	Heroin	48

24062 rows × 5 columns

(a) 现在请你将数据表转化成如下形态，每行需要显示每种药物在每个地区的10年至17年的变化情况，且前三列需要排序：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tiQSxO65-1588116074389)(picture/drug_pic.png)]

(b) 现在请将(a)中的结果恢复到原数据表，并通过equal函数检验初始表与新的结果是否一致（返回True）

drug = drug.sort_values(["State","COUNTY","SubstanceName"]).reset_index()

drug

result = pd.pivot_table(drug,index=["State","COUNTY","SubstanceName"],columns="YYYY",values="DrugReports",aggfunc="sum",fill_value="-").reset_index()
result

YYYY	State	COUNTY	SubstanceName	2010	2011	2012	2013	2014	2015	2016	2017
0	KY	ADAIR	Buprenorphine	-	3	5	4	27	5	7	10
1	KY	ADAIR	Codeine	-	-	1	-	-	-	-	1
2	KY	ADAIR	Fentanyl	-	-	1	-	-	-	-	-
3	KY	ADAIR	Heroin	-	-	1	2	-	1	-	2
4	KY	ADAIR	Hydrocodone	6	9	10	10	9	7	11	3
...	...	...	...	...	...	...	...	...	...	...	...
6209	WV	WOOD	Oxycodone	6	4	24	7	7	11	7	1
6210	WV	WOOD	Tramadol	-	-	-	-	1	-	4	3
6211	WV	WYOMING	Buprenorphine	-	1	1	1	-	-	-	1
6212	WV	WYOMING	Hydrocodone	1	5	-	-	1	-	1	-
6213	WV	WYOMING	Oxycodone	5	4	14	12	5	-	-	-

6214 rows × 11 columns

result.melt(id_vars=["State","COUNTY","SubstanceName"],value_vars=[2010,2011,2012,2013,2014,2015,2016,2017],value_name="DrugReports")

	State	COUNTY	SubstanceName	YYYY	DrugReports
0	KY	ADAIR	Buprenorphine	2010	-
1	KY	ADAIR	Codeine	2010	-
2	KY	ADAIR	Fentanyl	2010	-
3	KY	ADAIR	Heroin	2010	-
4	KY	ADAIR	Hydrocodone	2010	6
...	...	...	...	...	...
49707	WV	WOOD	Oxycodone	2017	1
49708	WV	WOOD	Tramadol	2017	3
49709	WV	WYOMING	Buprenorphine	2017	1
49710	WV	WYOMING	Hydrocodone	2017	-
49711	WV	WYOMING	Oxycodone	2017	-

49712 rows × 5 columns

drug.head()

	index	YYYY	State	COUNTY	SubstanceName	DrugReports
0	2731	2011	KY	ADAIR	Buprenorphine	3
1	5319	2012	KY	ADAIR	Buprenorphine	5
2	8782	2013	KY	ADAIR	Buprenorphine	4
3	12163	2014	KY	ADAIR	Buprenorphine	27
4	13645	2015	KY	ADAIR	Buprenorphine	5

result= pd.DataFrame()
for name,group in drug.groupby(["State","COUNTY"]):
#     print("="*100)
#     print(name[0])
#     print("="*100)
#     print(group)
    group1 = pd.crosstab(index=group["SubstanceName"],columns=group["YYYY"],values=group["DrugReports"],aggfunc="sum").fillna("-").sort_index()

    group1["State"] = name[0]
    group1["COUNTY"] = name[1]
#     print("="*100)
#     print(group1)
    result = pd.concat([group1,result])

d:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:13: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  del sys.path[0]

result

	2010	2011	2012	2013	2014	2016	2017	State	COUNTY	2015
SubstanceName
Buprenorphine	-	1	1	1	-	-	1	WV	WYOMING	NaN
Hydrocodone	1	5	-	-	1	1	-	WV	WYOMING	NaN
Oxycodone	5	4	14	12	5	-	-	WV	WYOMING	NaN
Acetyl fentanyl	-	-	-	-	-	-	1	WV	WOOD	20
Acryl fentanyl	-	-	-	-	-	-	9	WV	WOOD	-
...	...	...	...	...	...	...	...	...	...	...
Hydromorphone	-	-	1	-	-	1	-	KY	ADAIR	-
Methadone	1	-	1	-	-	-	-	KY	ADAIR	-
Morphine	-	2	4	-	-	1	-	KY	ADAIR	1
Oxycodone	-	4	1	1	9	-	1	KY	ADAIR	2
Tramadol	-	1	-	-	-	-	-	KY	ADAIR	-

6214 rows × 10 columns

(b) 现在请将(a)中的结果恢复到原数据表，并通过equal函数检验初始表与新的结果是否一致（返回True）m

【练习二】现有一份关于某地区地震情况的数据集，请解决如下问题：

(a) 现在请你将数据表转化成如下形态，将方向列展开，并将距离、深度和烈度三个属性压缩：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kXzZ7lsM-1588116074392)(picture/earthquake_pic.png)]

(b) 现在请将(a)中的结果恢复到原数据表，并通过equal函数检验初始表与新的结果是否一致（返回True）

m0_37351072

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第4章

import pandas as pdimport numpy as npfrom pandas import Series,DataFrameimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlinedata = pd.read_csv("data/table.csv")data.h...
复制链接

扫一扫