第4章

import pandas  as pd
import numpy as np
from pandas import Series,DataFrame
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data =  pd.read_csv("data/table.csv")
data.head()
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+

2.piovt_table

pd.pivot_table(data,index="ID",columns="Gender",values="Height").head()
# index为聚合的行
# columns为聚合的列
# values为每个单元格要计算的值
GenderFM
ID
1101NaN173.0
1102192.0NaN
1103NaN186.0
1104167.0NaN
1105159.0NaN
Pandas中提供了各种选项,下面介绍常用参数:
① aggfunc:对组内进行聚合统计,可传入各类函数,默认为’mean’
pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc=["sum","max","mean"])
summaxmean
GenderFMFMFM
School
S_113851251192195173.125000178.714286
S_219111548194193173.727273172.000000
② margins:汇总边际状态
由以下可以看出margins的汇总值是和aggfunc的函数有关的
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="mean",margins=True,margins_name="合计"))
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="sum",margins=True,margins_name="合计"))
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="max",margins=True,margins_name="合计"))
display(pd.pivot_table(data,index="School",columns="Gender",values="Height",aggfunc="median",margins=True,margins_name="合计"))
# display(pd.pivot_table(data,index="School",columns="Gender",values="Height",margins=True,margins_name="合计"))
# ?????????????
# 具体是如何的呢?
GenderFM合计
School
S_1173.125000178.714286175.733333
S_2173.727273172.000000172.950000
合计173.473684174.937500174.142857
GenderFM合计
School
S_1138512512636
S_2191115483459
合计329627996095
GenderFM合计
School
S_1192195195
S_2194193194
合计194195195
GenderFM合计
School
S_1171186175.0
S_2164171170.5
合计167173173.0

③ 行、列、值都可以为多级

pd.pivot_table(data,index=["School","Class"],columns=["Gender","Address"],values=["Height","Weight"],margins=True,margins_name="合计").T
SchoolS_1S_2合计
ClassC_1C_2C_3C_1C_2C_3C_4
GenderAddress
HeightFstreet_1NaNNaN175.0NaNNaNNaNNaN175.000000
street_2179.5NaNNaNNaNNaNNaN176.0177.750000
street_4159.0176.0NaNNaNNaN157.0NaN164.000000
street_5NaN162.0187.0159.0NaNNaNNaN169.333333
street_6NaN167.0NaN161.0NaN164.0175.5168.600000
street_7NaNNaNNaNNaN188.5190.0NaN189.000000
Mstreet_1173.0NaNNaNNaN175.0NaNNaN174.000000
street_2186.0NaN195.0NaNNaNNaNNaN190.500000
street_4NaNNaN161.0163.5155.0187.0NaN166.000000
street_5NaN188.0NaNNaN193.0171.0NaN184.000000
street_6NaN160.0NaNNaNNaNNaNNaN160.000000
street_7NaNNaN188.0174.0NaNNaN166.0176.000000
合计175.4170.6181.2164.2180.0173.8173.8174.142857
WeightFstreet_1NaNNaN57.0NaNNaNNaNNaN57.000000
street_277.0NaNNaNNaNNaNNaN73.075.000000
street_464.094.0NaNNaNNaN78.0NaN78.666667
street_5NaN63.069.097.0NaNNaNNaN76.333333
street_6NaN63.0NaN61.0NaN81.057.063.800000
street_7NaNNaNNaNNaN76.599.0NaN84.000000
Mstreet_163.0NaNNaNNaN74.0NaNNaN68.500000
street_282.0NaN70.0NaNNaNNaNNaN76.000000
street_4NaNNaN68.071.091.073.0NaN74.800000
street_5NaN68.0NaNNaN100.088.0NaN85.333333
street_6NaN53.0NaNNaNNaNNaNNaN53.000000
street_7NaNNaN82.084.0NaNNaN82.082.666667
合计72.668.269.276.883.683.868.474.657143
help(pd.pivot_table)
Help on function pivot_table in module pandas.core.reshape.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)
    Create a spreadsheet-style pivot table as a DataFrame. The levels in
    the pivot table will be stored in MultiIndex objects (hierarchical
    indexes) on the index and columns of the result DataFrame.
    
    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table column.  If an array is passed,
        it is being used as the same manner as column values.
    aggfunc : function, list of functions, dict, default numpy.mean
        If list of functions passed, the resulting pivot table will have
        hierarchical columns whose top level are the function names
        (inferred from the function objects themselves)
        If dict is passed, the key is column to aggregate and value
        is function or list of functions
    fill_value : scalar, default None
        Value to replace missing values with
    margins : boolean, default False
        Add all row / columns (e.g. for subtotal / grand totals)
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    margins_name : string, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.
    observed : boolean, default False
        This only applies if any of the groupers are Categoricals.
        If True: only show observed values for categorical groupers.
        If False: show all values for categorical groupers.
    
        .. versionchanged :: 0.25.0
    
    Returns
    -------
    DataFrame
    
    See Also
    --------
    DataFrame.pivot : Pivot without aggregation that can handle
        non-numeric data.
    
    Examples
    --------
    >>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
    ...                          "bar", "bar", "bar", "bar"],
    ...                    "B": ["one", "one", "one", "two", "two",
    ...                          "one", "one", "two", "two"],
    ...                    "C": ["small", "large", "large", "small",
    ...                          "small", "large", "small", "small",
    ...                          "large"],
    ...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    ...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
    >>> df
         A    B      C  D  E
    0  foo  one  small  1  2
    1  foo  one  large  2  4
    2  foo  one  large  2  5
    3  foo  two  small  3  5
    4  foo  two  small  3  6
    5  bar  one  large  4  6
    6  bar  one  small  5  8
    7  bar  two  small  6  9
    8  bar  two  large  7  9
    
    This first example aggregates values by taking the sum.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum)
    >>> table
    C        large  small
    A   B
    bar one    4.0    5.0
        two    7.0    6.0
    foo one    4.0    1.0
        two    NaN    6.0
    
    We can also fill missing values using the `fill_value` parameter.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum, fill_value=0)
    >>> table
    C        large  small
    A   B
    bar one      4      5
        two      7      6
    foo one      4      1
        two      0      6
    
    The next example aggregates by taking the mean across multiple columns.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': np.mean})
    >>> table
                    D         E
    A   C
    bar large  5.500000  7.500000
        small  5.500000  8.500000
    foo large  2.000000  4.500000
        small  2.333333  4.333333
    
    We can also calculate multiple types of aggregations for any given
    value column.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': [min, max, np.mean]})
    >>> table
                    D    E
                mean  max      mean  min
    A   C
    bar large  5.500000  9.0  7.500000  6.0
        small  5.500000  9.0  8.500000  8.0
    foo large  2.000000  5.0  4.500000  4.0
        small  2.333333  6.0  4.333333  2.0


3、crosstab(交叉表)

交叉表是一种特殊的透视表,典型的用途如分组统计,如现在想要统计关于街道和性别分组的频数:

pd.crosstab(index=data["Gender"],columns=data["School"])
SchoolS_1S_2
Gender
F811
M79
help(pd.crosstab)
Help on function crosstab in module pandas.core.reshape.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.
    
        .. versionadded:: 0.21.0
    
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
        .. versionadded:: 0.18.1
    
    Returns
    -------
    DataFrame
        Cross tabulation of the data.
    
    See Also
    --------
    DataFrame.pivot : Reshape data based on column values.
    pivot_table : Create a pivot table as a DataFrame.
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified.
    
    Any input passed containing Categorical data will have **all** of its
    categories included in the cross-tabulation, even if the actual data does
    not contain any instances of a particular category.
    
    In the event that there aren't overlapping indexes an empty DataFrame will
    be returned.
    
    Examples
    --------
    >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
    ...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
    >>> b = np.array(["one", "one", "one", "two", "one", "one",
    ...               "one", "two", "two", "two", "one"], dtype=object)
    >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
    ...               "shiny", "dull", "shiny", "shiny", "shiny"],
    ...              dtype=object)
    >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
    Here 'c' and 'f' are not represented in the data and will not be
    shown in the output because dropna is True by default. Set
    dropna=False to preserve categories with no data.
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> pd.crosstab(foo, bar)
    col_0  d  e
    row_0
    a      1  0
    b      0  1
    >>> pd.crosstab(foo, bar, dropna=False)
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0
pd.crosstab(index=data["Gender"],columns=data["School"],normalize=True,margins=True,margins_name="合计")
SchoolS_1S_2合计
Gender
F0.2285710.3142860.542857
M0.2000000.2571430.457143
合计0.4285710.5714291.000000
pd.crosstab(index=data["Gender"],columns=data["School"],values= np.random.randint(1,20,data.shape[0]),aggfunc="min",normalize=True,margins=True,margins_name="合计")
SchoolS_1S_2合计
Gender
F0.1538460.0769230.25
M0.2307690.5384620.75
合计0.6666670.3333331.00
np.random.randint(1,20,data.shape[0])
array([11, 18,  2,  4,  5,  8,  7, 14, 17, 16, 10,  7,  4,  2, 13,  3, 18,
       11, 14, 16,  6, 10, 17, 15,  2, 10, 13,  5, 14, 17,  5, 13,  7,  5,
        5])

二、其他变形方法

1. melt

melt函数可以认为是pivot函数的逆操作,将unstacked状态的数据,压缩成stacked,使“宽”的DataFrame变“窄”

data2 = data[["ID","Gender","Math"]]
pivoted = data2.pivot(index="ID",columns="Gender",values="Math")
pivoted
GenderFM
ID
1101NaN34.0
110232.5NaN
1103NaN87.2
110480.4NaN
110584.8NaN
1201NaN97.0
120263.5NaN
1203NaN58.8
120433.8NaN
120568.4NaN
1301NaN31.5
130287.7NaN
1303NaN49.7
1304NaN85.2
130561.7NaN
2101NaN83.3
210250.6NaN
2103NaN52.5
210472.2NaN
2105NaN34.2
2201NaN39.1
220268.5NaN
2203NaN73.8
2204NaN47.2
220585.4NaN
230172.3NaN
2302NaN32.7
230365.9NaN
230495.5NaN
2305NaN48.9
240145.3NaN
2402NaN48.7
240359.7NaN
240467.7NaN
240547.6NaN

melt函数中的id_vars表示需要保留的列,value_vars表示需要stack的一组列

result = pivoted.reset_index().melt(id_vars="ID",value_vars=["F","M"],value_name="Math").dropna().set_index("ID").sort_index()
result
GenderMath
ID
1101M34.0
1102F32.5
1103M87.2
1104F80.4
1105F84.8
1201M97.0
1202F63.5
1203M58.8
1204F33.8
1205F68.4
1301M31.5
1302F87.7
1303M49.7
1304M85.2
1305F61.7
2101M83.3
2102F50.6
2103M52.5
2104F72.2
2105M34.2
2201M39.1
2202F68.5
2203M73.8
2204M47.2
2205F85.4
2301F72.3
2302M32.7
2303F65.9
2304F95.5
2305M48.9
2401F45.3
2402M48.7
2403F59.7
2404F67.7
2405F47.6
result.equals(data2.set_index("ID"))
True

2. 压缩与展开

(1)stack:这是最基础的变形函数,总共只有两个参数:level和dropna
data_s = pd.pivot_table(data,index=["Class","ID"],columns="Gender",values=["Height","Weight"])
data_s.head()
HeightWeight
GenderFMFM
ClassID
C_11101NaN173.0NaN63.0
1102192.0NaN73.0NaN
1103NaN186.0NaN82.0
1104167.0NaN81.0NaN
1105159.0NaN64.0NaN
data_s.groupby("Class").head(2)
HeightWeight
GenderFMFM
ClassID
C_11101NaN173.0NaN63.0
1102192.0NaN73.0NaN
C_21201NaN188.0NaN68.0
1202176.0NaN94.0NaN
C_31301NaN161.0NaN68.0
1302175.0NaN57.0NaN
C_42401192.0NaN62.0NaN
2402NaN166.0NaN82.0
data_stacked = data_s.stack()
data_stacked.groupby("Class").head(2)
HeightWeight
ClassIDGender
C_11101M173.063.0
1102F192.073.0
C_21201M188.068.0
1202F176.094.0
C_31301M161.068.0
1302F175.057.0
C_42401F192.062.0
2402M166.082.0

stack函数可以看做将横向的索引放到纵向,因此功能类似与melt,参数level可指定变化的列索引是哪一层(或哪几层,需要列表)

data_stacked =  data_s.stack()
# help(pd.stack())
data_stacked.groupby("Class").head(2)
HeightWeight
ClassIDGender
C_11101M173.063.0
1102F192.073.0
C_21201M188.068.0
1202F176.094.0
C_31301M161.068.0
1302F175.057.0
C_42401F192.062.0
2402M166.082.0
data_s
HeightWeight
GenderFMFM
ClassID
C_11101NaN173.0NaN63.0
1102192.0NaN73.0NaN
1103NaN186.0NaN82.0
1104167.0NaN81.0NaN
1105159.0NaN64.0NaN
2101NaN174.0NaN84.0
2102161.0NaN61.0NaN
2103NaN157.0NaN61.0
2104159.0NaN97.0NaN
2105NaN170.0NaN81.0
C_21201NaN188.0NaN68.0
1202176.0NaN94.0NaN
1203NaN160.0NaN53.0
1204162.0NaN63.0NaN
1205167.0NaN63.0NaN
2201NaN193.0NaN100.0
2202194.0NaN77.0NaN
2203NaN155.0NaN91.0
2204NaN175.0NaN74.0
2205183.0NaN76.0NaN
C_31301NaN161.0NaN68.0
1302175.0NaN57.0NaN
1303NaN188.0NaN82.0
1304NaN195.0NaN70.0
1305187.0NaN69.0NaN
2301157.0NaN78.0NaN
2302NaN171.0NaN88.0
2303190.0NaN99.0NaN
2304164.0NaN81.0NaN
2305NaN187.0NaN73.0
C_42401192.0NaN62.0NaN
2402NaN166.0NaN82.0
2403158.0NaN60.0NaN
2404160.0NaN84.0NaN
2405193.0NaN54.0NaN
(2) unstack:stack的逆函数,功能上类似于pivot_table
result2 = data_stacked.unstack().swaplevel(0,0,axis=1).sort_index(axis=1)
result2.equals(data_s)
True
result2
HeightWeight
GenderFMFM
ClassID
C_11101NaN173.0NaN63.0
1102192.0NaN73.0NaN
1103NaN186.0NaN82.0
1104167.0NaN81.0NaN
1105159.0NaN64.0NaN
2101NaN174.0NaN84.0
2102161.0NaN61.0NaN
2103NaN157.0NaN61.0
2104159.0NaN97.0NaN
2105NaN170.0NaN81.0
C_21201NaN188.0NaN68.0
1202176.0NaN94.0NaN
1203NaN160.0NaN53.0
1204162.0NaN63.0NaN
1205167.0NaN63.0NaN
2201NaN193.0NaN100.0
2202194.0NaN77.0NaN
2203NaN155.0NaN91.0
2204NaN175.0NaN74.0
2205183.0NaN76.0NaN
C_31301NaN161.0NaN68.0
1302175.0NaN57.0NaN
1303NaN188.0NaN82.0
1304NaN195.0NaN70.0
1305187.0NaN69.0NaN
2301157.0NaN78.0NaN
2302NaN171.0NaN88.0
2303190.0NaN99.0NaN
2304164.0NaN81.0NaN
2305NaN187.0NaN73.0
C_42401192.0NaN62.0NaN
2402NaN166.0NaN82.0
2403158.0NaN60.0NaN
2404160.0NaN84.0NaN
2405193.0NaN54.0NaN

三、哑变量与因子化

1. Dummy Variable(哑变量)

这里主要介绍get_dummies函数,其功能主要是进行one-hot编码:
data.head()
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+
data_d = data[["Class","Gender","Weight"]]
data_d.head()
ClassGenderWeight
0C_1M63
1C_1F73
2C_1M82
3C_1F81
4C_1F64
pd.get_dummies(data_d[["Class","Gender"]].join(data_d["Weight"])).head()
WeightClass_C_1Class_C_2Class_C_3Class_C_4Gender_FGender_M
063100001
173100010
282100001
381100010
464100010

2. factorize方法

该方法主要用于自然数编码,并且缺失值会被记做-1,其中sort参数表示是否排序后赋值
codes,uniques = pd.factorize(["B",None,"A","C","B","D","F","A"])
codes
array([ 0, -1,  1,  2,  0,  3,  4,  1], dtype=int64)
uniques
array(['B', 'A', 'C', 'D', 'F'], dtype=object)

#四、问题与练习

1. 问题

【问题一】 上面提到了许多变形函数,如melt/crosstab/pivot/pivot_table/stack/unstack函数,请总结它们各自的使用特点。
【问题二】 变形函数和多级索引是什么关系?哪些变形函数会使得索引维数变化?具体如何变化?
【问题三】 请举出一个除了上文提过的关于哑变量方法的例子。
【问题四】 使用完stack后立即使用unstack一定能保证变化结果与原始表完全一致吗?
【问题五】 透视表中涉及了三个函数,请分别使用它们完成相同的目标(任务自定)并比较哪个速度最快。
【问题六】 既然melt起到了stack的功能,为什么再设计stack函数?

2. 练习

【练习一】 继续使用上一章的药物数据集:
drug = pd.read_csv("data/Drugs.csv")
drug
YYYYStateCOUNTYSubstanceNameDrugReports
02010VAACCOMACKPropoxyphene1
12010OHADAMSMorphine9
22010PAADAMSMethadone2
32010VAALEXANDRIA CITYHeroin5
42010PAALLEGHENYHydromorphone5
..................
240572017VAWYTHECodeine1
240582017VAWYTHEHydrocodone19
240592017VAWYTHETramadol5
240602017PAYORKANPP1
240612017VAYORKHeroin48

24062 rows × 5 columns

(a) 现在请你将数据表转化成如下形态,每行需要显示每种药物在每个地区的10年至17年的变化情况,且前三列需要排序:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tiQSxO65-1588116074389)(picture/drug_pic.png)]

(b) 现在请将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)
drug = drug.sort_values(["State","COUNTY","SubstanceName"]).reset_index()

drug

result = pd.pivot_table(drug,index=["State","COUNTY","SubstanceName"],columns="YYYY",values="DrugReports",aggfunc="sum",fill_value="-").reset_index()
result
YYYYStateCOUNTYSubstanceName20102011201220132014201520162017
0KYADAIRBuprenorphine-354275710
1KYADAIRCodeine--1----1
2KYADAIRFentanyl--1-----
3KYADAIRHeroin--12-1-2
4KYADAIRHydrocodone69101097113
....................................
6209WVWOODOxycodone6424771171
6210WVWOODTramadol----1-43
6211WVWYOMINGBuprenorphine-111---1
6212WVWYOMINGHydrocodone15--1-1-
6213WVWYOMINGOxycodone5414125---

6214 rows × 11 columns

result.melt(id_vars=["State","COUNTY","SubstanceName"],value_vars=[2010,2011,2012,2013,2014,2015,2016,2017],value_name="DrugReports")
StateCOUNTYSubstanceNameYYYYDrugReports
0KYADAIRBuprenorphine2010-
1KYADAIRCodeine2010-
2KYADAIRFentanyl2010-
3KYADAIRHeroin2010-
4KYADAIRHydrocodone20106
..................
49707WVWOODOxycodone20171
49708WVWOODTramadol20173
49709WVWYOMINGBuprenorphine20171
49710WVWYOMINGHydrocodone2017-
49711WVWYOMINGOxycodone2017-

49712 rows × 5 columns

drug.head()
indexYYYYStateCOUNTYSubstanceNameDrugReports
027312011KYADAIRBuprenorphine3
153192012KYADAIRBuprenorphine5
287822013KYADAIRBuprenorphine4
3121632014KYADAIRBuprenorphine27
4136452015KYADAIRBuprenorphine5
result= pd.DataFrame()
for name,group in drug.groupby(["State","COUNTY"]):
#     print("="*100)
#     print(name[0])
#     print("="*100)
#     print(group)
    group1 = pd.crosstab(index=group["SubstanceName"],columns=group["YYYY"],values=group["DrugReports"],aggfunc="sum").fillna("-").sort_index()

    group1["State"] = name[0]
    group1["COUNTY"] = name[1]
#     print("="*100)
#     print(group1)
    result = pd.concat([group1,result])
d:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:13: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  del sys.path[0]
result
2010201120122013201420162017StateCOUNTY2015
SubstanceName
Buprenorphine-111--1WVWYOMINGNaN
Hydrocodone15--11-WVWYOMINGNaN
Oxycodone5414125--WVWYOMINGNaN
Acetyl fentanyl------1WVWOOD20
Acryl fentanyl------9WVWOOD-
.................................
Hydromorphone--1--1-KYADAIR-
Methadone1-1----KYADAIR-
Morphine-24--1-KYADAIR1
Oxycodone-4119-1KYADAIR2
Tramadol-1-----KYADAIR-

6214 rows × 10 columns

(b) 现在请将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)m

【练习二】 现有一份关于某地区地震情况的数据集,请解决如下问题:
(a) 现在请你将数据表转化成如下形态,将方向列展开,并将距离、深度和烈度三个属性压缩:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kXzZ7lsM-1588116074392)(picture/earthquake_pic.png)]

(b) 现在请将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值