pandas 第十二期组队-pandas基础

import pandas as pd 
import numpy as np
pd.__version__
'1.0.3'

文件的读写#

csv格式

data=pd.read_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.csv')
data.head(20)
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+
5S_1C_21201Mstreet_51886897.0A-
6S_1C_21202Fstreet_41769463.5B-
7S_1C_21203Mstreet_61605358.8A+
8S_1C_21204Fstreet_51626333.8B
9S_1C_21205Fstreet_61676368.4B-
10S_1C_31301Mstreet_41616831.5B+
11S_1C_31302Fstreet_11755787.7A-
12S_1C_31303Mstreet_71888249.7B
13S_1C_31304Mstreet_21957085.2A
14S_1C_31305Fstreet_51876961.7B-
15S_2C_12101Mstreet_71748483.3C
16S_2C_12102Fstreet_61616150.6B+
17S_2C_12103Mstreet_41576152.5B-
18S_2C_12104Fstreet_51599772.2B+
19S_2C_12105Mstreet_41708134.2A

txt格式

data_txt=pd.read_table('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.txt')
#读取文件的时候,默认是以,作为分隔符,可以指定其他分隔符
data_txt.head(20)
col1col2col3col4
02a1.4apple
13b3.4banana
26c2.5orange
35d3.2lemon

xls或者xlsx格式

data_excel=pd.read_excel('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.xlsx')
data_excel.head(20)
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+
5S_1C_21201Mstreet_51886897.0A-
6S_1C_21202Fstreet_41769463.5B-
7S_1C_21203Mstreet_61605358.8A+
8S_1C_21204Fstreet_51626333.8B
9S_1C_21205Fstreet_61676368.4B-
10S_1C_31301Mstreet_41616831.5B+
11S_1C_31302Fstreet_11755787.7A-
12S_1C_31303Mstreet_71888249.7B
13S_1C_31304Mstreet_21957085.2A
14S_1C_31305Fstreet_51876961.7B-
15S_2C_12101Mstreet_71748483.3C
16S_2C_12102Fstreet_61616150.6B+
17S_2C_12103Mstreet_41576152.5B-
18S_2C_12104Fstreet_51599772.2B+
19S_2C_12105Mstreet_41708134.2A

写入

csv格式

data.to_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.csv')

其他的不过多展示了

数据的基本结构

Series

创建一个Series

  • 对于一个Series,其中最常用的属性为值(values),索引(index),名字(name),类型(type)
s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'],name='this is Series',dtype='float64')
s
a   -1.576202
b    1.070319
c    0.668490
d    1.760701
e    0.942655
Name: this is Series, dtype: float64

访问Series属性

s.values
array([-1.57620183,  1.07031892,  0.66849049,  1.76070095,  0.94265474])

都会就不过多敲代码了

取出元素

#根据索引来取出
s['a']
-1.5762018254215895

调用方法

print([attr for attr in dir(s) if not attr.startswith('_')])
#字段名称打印?
['T', 'a', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'b', 'between', 'between_time', 'bfill', 'bool', 'c', 'clip', 'combine', 'combine_first', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'd', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtype', 'dtypes', 'duplicated', 'e', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']

DataFrame

创建一个DF

df=pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},index=list('12345'))
df
col1col2col3
1a51.3
2b62.5
3c73.6
4d84.6
5e95.8

从DF中取出一列为Series

df['col1']
1    a
2    b
3    c
4    d
5    e
Name: col1, dtype: object
type(df)
pandas.core.frame.DataFrame
type(df['col1'])
pandas.core.series.Series

索引对齐

df1=pd.DataFrame({"A":[1,2,3]},index=[1,2,3])
df2=pd.DataFrame({'A':[1,2,3]},index=[3,1,2])
df1-df2
#因为存在索引对齐,所以结果不是0
A
1-1
2-1
32

列的删除与添加

  • 可以使用drop函数,del函数与pop函数
df.drop(index='5',columns='col1')
col2col3
151.3
262.5
373.6
484.6
#pop方法直接在DF上操作,且返回被删除的列,与Python中的pop函数类似
df['col1']=[1,2,3,4,5]
df.pop('col1')
1    1
2    2
3    3
4    4
5    5
Name: col1, dtype: int64
df
col2col3
151.3
262.5
373.6
484.6
595.8
#可以直接增加,也可通过assign可以增加新列
df1['B']=list('abc')
df1
AB
11a
22b
33c
df1.assign(C=pd.Series(list('def')))
#assign方法不会对DF进行修改,所以要重新给予新的变量
ABC
11ae
22bf
33cNaN

根据不同类型选择列

df.select_dtypes(include=['number']).head()
col2col3
151.3
262.5
373.6
484.6
595.8

将series转换成DF

s=df.mean()
s.name='to_DF'
s
col2    7.00
col3    3.56
Name: to_DF, dtype: float64
s.to_frame()
to_DF
col27.00
col33.56
#使用T进行转置
s.to_frame().T
col2col3
to_DF7.03.56

常用的基本函数

df=pd.read_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\table.csv')
df=df.drop(columns='Unnamed: 0')
df
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+
5S_1C_21201Mstreet_51886897.0A-
6S_1C_21202Fstreet_41769463.5B-
7S_1C_21203Mstreet_61605358.8A+
8S_1C_21204Fstreet_51626333.8B
9S_1C_21205Fstreet_61676368.4B-
10S_1C_31301Mstreet_41616831.5B+
11S_1C_31302Fstreet_11755787.7A-
12S_1C_31303Mstreet_71888249.7B
13S_1C_31304Mstreet_21957085.2A
14S_1C_31305Fstreet_51876961.7B-
15S_2C_12101Mstreet_71748483.3C
16S_2C_12102Fstreet_61616150.6B+
17S_2C_12103Mstreet_41576152.5B-
18S_2C_12104Fstreet_51599772.2B+
19S_2C_12105Mstreet_41708134.2A
20S_2C_22201Mstreet_519310039.1B
21S_2C_22202Fstreet_71947768.5B+
22S_2C_22203Mstreet_41559173.8A+
23S_2C_22204Mstreet_11757447.2B-
24S_2C_22205Fstreet_71837685.4B
25S_2C_32301Fstreet_41577872.3B+
26S_2C_32302Mstreet_51718832.7A
27S_2C_32303Fstreet_71909965.9C
28S_2C_32304Fstreet_61648195.5A-
29S_2C_32305Mstreet_41877348.9B
30S_2C_42401Fstreet_21926245.3A
31S_2C_42402Mstreet_71668248.7B
32S_2C_42403Fstreet_61586059.7B+
33S_2C_42404Fstreet_21608467.7B
34S_2C_42405Fstreet_61935447.6B

unique和nunique

nunique显示有多少个唯一值 —数量

df['Physics'].nunique()
7

unique显示所有的唯一值 —具体值

df['Physics'].unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

count and value_counts

df['Physics'].count()
#注意,返回的是非缺失值的元素个数!
35
#value_counts返回每个元素有多少个
df['Physics'].value_counts()
B+    9
B     8
B-    6
A     4
A+    3
A-    3
C     2
Name: Physics, dtype: int64

describe and info

#info函数范围有哪些列,有多少非缺失值,每列的类型
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    35 non-null     object 
 2   ID       35 non-null     int64  
 3   Gender   35 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   35 non-null     int64  
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object 
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB
#describe默认统计数值型数据的各个统计量
df.describe()
IDHeightWeightMath
count35.0000035.00000035.00000035.000000
mean1803.00000174.14285774.65714361.351429
std536.8774113.54109812.89537719.915164
min1101.00000155.00000053.00000031.500000
25%1204.50000161.00000063.00000047.400000
50%2103.00000173.00000074.00000061.700000
75%2301.50000187.50000082.00000077.100000
max2405.00000195.000000100.00000097.000000
#describe函数可以调整不同的分位数
df.describe(percentiles=[.05,.25,.75,.95])#注意是[]
IDHeightWeightMath
count35.0000035.00000035.00000035.000000
mean1803.00000174.14285774.65714361.351429
std536.8774113.54109812.89537719.915164
min1101.00000155.00000053.00000031.500000
5%1102.70000157.00000056.10000032.640000
25%1204.50000161.00000063.00000047.400000
50%2103.00000173.00000074.00000061.700000
75%2301.50000187.50000082.00000077.100000
95%2403.30000193.30000097.60000090.040000
max2405.00000195.000000100.00000097.000000
#对于非数值型也可以使用describe函数
df['Physics'].describe()
count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

idxmax and nlargest

# idxmax函数返回最大值,在某些情况下使用,同理类比idxmin,
df['Math'].idxmax()
df['Math'].idxmin()
10
#nlargest函数范围前几个大的元素值,nsmallest功能类似
df['Math'].nlargest()#()可以填写数量
5     97.0
28    95.5
11    87.7
2     87.2
24    85.4
Name: Math, dtype: float64

clip and repalce

#clip是对超过或低于某些数进行截断
df['Math'].clip(33,80).head(30)
0     34.0
1     33.0
2     80.0
3     80.0
4     80.0
5     80.0
6     63.5
7     58.8
8     33.8
9     68.4
10    33.0
11    80.0
12    49.7
13    80.0
14    61.7
15    80.0
16    50.6
17    52.5
18    72.2
19    34.2
20    39.1
21    68.5
22    73.8
23    47.2
24    80.0
25    72.3
26    33.0
27    65.9
28    80.0
29    48.9
Name: Math, dtype: float64
#repalce是对某些值进行替换
df['Address'].head()
0    street_1
1    street_2
2    street_2
3    street_2
4    street_4
Name: Address, dtype: object
df['Address'].repalce(['street_1','street_2'],['one','two']).head()
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-97-4b99be6262e7> in <module>
----> 1 df['Address'].repalce(['street_1','street_2'],['one','two']).head()


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5273                 return self[name]
-> 5274             return object.__getattribute__(self, name)
   5275 
   5276     def __setattr__(self, name: str, value) -> None:


AttributeError: 'Series' object has no attribute 'repalce'

apply函数

#对于series,它可以迭代每一列值操作
df['Math'].apply(lambda x:str(x)+'!').head()
0    34.0!
1    32.5!
2    87.2!
3    80.4!
4    84.8!
Name: Math, dtype: object
#对于DF,迭代每一列操作
df.apply(lambda x:x.apply(lambda x:str(x)+'!')).head()
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1!C_1!1101!M!street_1!173!63!34.0!A+!
1S_1!C_1!1102!F!street_2!192!73!32.5!B+!
2S_1!C_1!1103!M!street_2!186!82!87.2!B+!
3S_1!C_1!1104!F!street_2!167!81!80.4!B-!
4S_1!C_1!1105!F!street_4!159!64!84.8!B+!
##对于apply的扩展-可以选择某些列单独加
df['Math'].apply(lambda x:str(x)+"!")
0     34.0!
1     32.5!
2     87.2!
3     80.4!
4     84.8!
5     97.0!
6     63.5!
7     58.8!
8     33.8!
9     68.4!
10    31.5!
11    87.7!
12    49.7!
13    85.2!
14    61.7!
15    83.3!
16    50.6!
17    52.5!
18    72.2!
19    34.2!
20    39.1!
21    68.5!
22    73.8!
23    47.2!
24    85.4!
25    72.3!
26    32.7!
27    65.9!
28    95.5!
29    48.9!
30    45.3!
31    48.7!
32    59.7!
33    67.7!
34    47.6!
Name: Math, dtype: object

排序

索引排序

df.set_index('Math').head(20)
#将Math设为索引
SchoolClassIDGenderAddressHeightWeightPhysics
Math
34.0S_1C_11101Mstreet_117363A+
32.5S_1C_11102Fstreet_219273B+
87.2S_1C_11103Mstreet_218682B+
80.4S_1C_11104Fstreet_216781B-
84.8S_1C_11105Fstreet_415964B+
97.0S_1C_21201Mstreet_518868A-
63.5S_1C_21202Fstreet_417694B-
58.8S_1C_21203Mstreet_616053A+
33.8S_1C_21204Fstreet_516263B
68.4S_1C_21205Fstreet_616763B-
31.5S_1C_31301Mstreet_416168B+
87.7S_1C_31302Fstreet_117557A-
49.7S_1C_31303Mstreet_718882B
85.2S_1C_31304Mstreet_219570A
61.7S_1C_31305Fstreet_518769B-
83.3S_2C_12101Mstreet_717484C
50.6S_2C_12102Fstreet_616161B+
52.5S_2C_12103Mstreet_415761B-
72.2S_2C_12104Fstreet_515997B+
34.2S_2C_12105Mstreet_417081A
df.set_index('Math').sort_index().head()
SchoolClassIDGenderAddressHeightWeightPhysics
Math
31.5S_1C_31301Mstreet_416168B+
32.5S_1C_11102Fstreet_219273B+
32.7S_2C_32302Mstreet_517188A
33.8S_1C_21204Fstreet_516263B
34.0S_1C_11101Mstreet_117363A+

值排序

df.sort_values(by='Class').head(20)
SchoolClassIDGenderAddressHeightWeightMathPhysics
0S_1C_11101Mstreet_11736334.0A+
19S_2C_12105Mstreet_41708134.2A
18S_2C_12104Fstreet_51599772.2B+
16S_2C_12102Fstreet_61616150.6B+
15S_2C_12101Mstreet_71748483.3C
17S_2C_12103Mstreet_41576152.5B-
1S_1C_11102Fstreet_21927332.5B+
2S_1C_11103Mstreet_21868287.2B+
3S_1C_11104Fstreet_21678180.4B-
4S_1C_11105Fstreet_41596484.8B+
6S_1C_21202Fstreet_41769463.5B-
24S_2C_22205Fstreet_71837685.4B
23S_2C_22204Mstreet_11757447.2B-
22S_2C_22203Mstreet_41559173.8A+
21S_2C_22202Fstreet_71947768.5B+
5S_1C_21201Mstreet_51886897.0A-
20S_2C_22201Mstreet_519310039.1B
9S_1C_21205Fstreet_61676368.4B-
8S_1C_21204Fstreet_51626333.8B
7S_1C_21203Mstreet_61605358.8A+

多个值排序,即先对第一层排,在第一层相同的情况下,对第二层排序

df.sort_values(by=['Math','Physics']).head(20)
SchoolClassIDGenderAddressHeightWeightMathPhysics
10S_1C_31301Mstreet_41616831.5B+
1S_1C_11102Fstreet_21927332.5B+
26S_2C_32302Mstreet_51718832.7A
8S_1C_21204Fstreet_51626333.8B
0S_1C_11101Mstreet_11736334.0A+
19S_2C_12105Mstreet_41708134.2A
20S_2C_22201Mstreet_519310039.1B
30S_2C_42401Fstreet_21926245.3A
23S_2C_22204Mstreet_11757447.2B-
34S_2C_42405Fstreet_61935447.6B
31S_2C_42402Mstreet_71668248.7B
29S_2C_32305Mstreet_41877348.9B
12S_1C_31303Mstreet_71888249.7B
16S_2C_12102Fstreet_61616150.6B+
17S_2C_12103Mstreet_41576152.5B-
7S_1C_21203Mstreet_61605358.8A+
32S_2C_42403Fstreet_61586059.7B+
14S_1C_31305Fstreet_51876961.7B-
6S_1C_21202Fstreet_41769463.5B-
27S_2C_32303Fstreet_71909965.9C

问题与练习

问题

Series和DataFrame有哪些常见属性和方法?

  • for Series have values,name,type and index
  • for DF index and values

value_counts会统计缺失值吗?

  • NO

与idxmax和nlargest功能相反的是哪两组函数?

  • idxmin and nsmallest

df.mean(axis=1)是什么意思?它与df.mean()的结果一样吗?第一问提到的函数也有axis参数吗?怎么使用?

  • 针对DF进行行均值计算 不一样,前者是行,后者是列

练习

现有一份关于美剧《权力的游戏》剧本的数据集,请解决以下问题:

  • 在所有的数据中,一共出现了多少人物?
  • 以单元格计数(即简单把一个单元格视作一句),谁说了最多的话?
  • 以单词计数,谁说了最多的单词?
data1=pd.read_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\Game_of_Thrones_Script.csv')
data1.head(20)
Release DateSeasonEpisodeEpisode TitleNameSentence
02011/4/17Season 1Episode 1Winter is Comingwaymar royceWhat do you expect? They're savages. One lot s...
12011/4/17Season 1Episode 1Winter is ComingwillI've never seen wildlings do a thing like this...
22011/4/17Season 1Episode 1Winter is Comingwaymar royceHow close did you get?
32011/4/17Season 1Episode 1Winter is ComingwillClose as any man would.
42011/4/17Season 1Episode 1Winter is CominggaredWe should head back to the wall.
52011/4/17Season 1Episode 1Winter is ComingroyceDo the dead frighten you?
62011/4/17Season 1Episode 1Winter is CominggaredOur orders were to track the wildlings. We tra...
72011/4/17Season 1Episode 1Winter is ComingroyceYou don't think he'll ask us how they died? Ge...
82011/4/17Season 1Episode 1Winter is ComingwillWhatever did it to them could do it to us. The...
92011/4/17Season 1Episode 1Winter is ComingroyceIt's a good thing we're not children. You want...
102011/4/17Season 1Episode 1Winter is ComingroyceYour dead men seem to have moved camp.
112011/4/17Season 1Episode 1Winter is ComingwillThey were here.
122011/4/17Season 1Episode 1Winter is CominggaredSee where they went.
132011/4/17Season 1Episode 1Winter is ComingroyceWhat is it?
142011/4/17Season 1Episode 1Winter is CominggaredIt's …
152011/4/17Season 1Episode 1Winter is Comingjon snowGo on. Father's watching.
162011/4/17Season 1Episode 1Winter is Comingjon snowAnd your mother.
172011/4/17Season 1Episode 1Winter is Comingsepta mordaneFine work, as always. Well done.
182011/4/17Season 1Episode 1Winter is Comingsansa starkThank you.
192011/4/17Season 1Episode 1Winter is Comingsepta mordaneI love the detail that you've managed to get i...
data1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23911 entries, 0 to 23910
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Release Date   23911 non-null  object
 1   Season         23911 non-null  object
 2   Episode        23911 non-null  object
 3   Episode Title  23911 non-null  object
 4   Name           23911 non-null  object
 5   Sentence       23911 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB
data1['Name'].nunique()
#一共有564个人物
564
#针对单元格,使用describe,结果是tyrion lannister,次数是1760
data1['Name'].describe()
count                23911
unique                 564
top       tyrion lannister
freq                  1760
Name: Name, dtype: object

第三个题目不会

现有一份关于科比的投篮数据集,请解决如下问题:

  • 哪种action_type和combined_shot_type的组合是最多的?
  • 在所有被记录的game_id中,遭遇到最多的opponent是一个支?
data2=pd.read_csv('E:\jupyter Notebook\天池比赛\joyful-pandas-master\data\\Kobe_data.csv')
data2
shot_idaction_typecombined_shot_typegame_event_idgame_idlatloc_xloc_ylonminutes_remaining...shot_made_flagshot_typeshot_zone_areashot_zone_basicshot_zone_rangeteam_idteam_namegame_datematchupopponent
01Jump ShotJump Shot102000001233.972316772-118.102810...NaN2PT Field GoalRight Side(R)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
12Jump ShotJump Shot122000001234.0443-1570-118.426810...0.02PT Field GoalLeft Side(L)Mid-Range8-16 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
23Jump ShotJump Shot352000001233.9093-101135-118.37087...1.02PT Field GoalLeft Side Center(LC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
34Jump ShotJump Shot432000001233.8693138175-118.13186...0.02PT Field GoalRight Side Center(RC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
45Driving Dunk ShotDunk1552000001234.044300-118.26986...1.02PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
..................................................................
3069230693Jump ShotJump Shot3974990008833.9963148-118.26886...0.02PT Field GoalCenter(C)In The Paint (Non-RA)Less Than 8 ft.1610612747Los Angeles Lakers2000/6/19LAL vs. INDIND
3069330694Tip ShotTip Shot3984990008834.044300-118.26986...NaN2PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/6/19LAL vs. INDIND
3069430695Running Jump ShotJump Shot4264990008833.8783-134166-118.40383...1.02PT Field GoalLeft Side Center(LC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/6/19LAL vs. INDIND
3069530696Jump ShotJump Shot4484990008833.777331267-118.23882...0.03PT Field GoalCenter(C)Above the Break 324+ ft.1610612747Los Angeles Lakers2000/6/19LAL vs. INDIND
3069630697Jump ShotJump Shot4714990008833.9723172-118.26880...0.02PT Field GoalCenter(C)In The Paint (Non-RA)Less Than 8 ft.1610612747Los Angeles Lakers2000/6/19LAL vs. INDIND

30697 rows × 25 columns

data2=data2.set_index('shot_id')
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-135-7f5596703652> in <module>
----> 1 data2=data2.set_index('shot_id')
      2 data2.head(20)


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   4301 
   4302         if missing:
-> 4303             raise KeyError(f"None of {missing} are in the columns")
   4304 
   4305         if inplace:


KeyError: "None of ['shot_id'] are in the columns"
data2.head(20)
action_typecombined_shot_typegame_event_idgame_idlatloc_xloc_ylonminutes_remainingperiod...shot_made_flagshot_typeshot_zone_areashot_zone_basicshot_zone_rangeteam_idteam_namegame_datematchupopponent
shot_id
1Jump ShotJump Shot102000001233.972316772-118.1028101...NaN2PT Field GoalRight Side(R)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
2Jump ShotJump Shot122000001234.0443-1570-118.4268101...0.02PT Field GoalLeft Side(L)Mid-Range8-16 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
3Jump ShotJump Shot352000001233.9093-101135-118.370871...1.02PT Field GoalLeft Side Center(LC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
4Jump ShotJump Shot432000001233.8693138175-118.131861...0.02PT Field GoalRight Side Center(RC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
5Driving Dunk ShotDunk1552000001234.044300-118.269862...1.02PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
6Jump ShotJump Shot2442000001234.0553-145-11-118.414893...0.02PT Field GoalLeft Side(L)Mid-Range8-16 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
7Layup ShotLayup2512000001234.044300-118.269883...1.02PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
8Jump ShotJump Shot2542000001234.0163128-118.268883...NaN2PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
9Jump ShotJump Shot2652000001233.9363-65108-118.334863...1.02PT Field GoalLeft Side(L)In The Paint (Non-RA)8-16 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
10Running Jump ShotJump Shot2942000001233.9193-33125-118.302833...0.02PT Field GoalCenter(C)In The Paint (Non-RA)8-16 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
11Jump ShotJump Shot3092000001233.8063-94238-118.363813...0.03PT Field GoalLeft Side Center(LC)Above the Break 324+ ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPOR
12Jump ShotJump Shot42000001933.9173121127-118.1488111...1.02PT Field GoalRight Side Center(RC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
13Running Jump ShotJump Shot272000001933.9343-67110-118.336871...1.02PT Field GoalLeft Side(L)In The Paint (Non-RA)8-16 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
14Jump ShotJump Shot662000001934.0403-944-118.363821...0.02PT Field GoalLeft Side(L)Mid-Range8-16 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
15Jump ShotJump Shot802000001933.9973-2347-118.292811...0.02PT Field GoalCenter(C)In The Paint (Non-RA)Less Than 8 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
16Jump ShotJump Shot862000001933.852362192-118.207801...0.02PT Field GoalCenter(C)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
17Driving Layup ShotLayup1002000001934.044300-118.269801...NaN2PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
18Jump ShotJump Shot1382000001933.8183-117226-118.386882...1.03PT Field GoalLeft Side Center(LC)Above the Break 324+ ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
19Jump ShotJump Shot2442000001933.9473-13297-118.4018113...0.02PT Field GoalLeft Side Center(LC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA
20Driving Layup ShotLayup2492000001934.044300-118.2698103...NaN2PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/11/1LAL vs. UTAUTA

20 rows × 24 columns

data2.groupby(by=('action_type','combined_shot_type'))
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-138-e68698f5f785> in <module>
----> 1 data2.groupby(by=('action_type','combined_shot_type'))


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   5808             group_keys=group_keys,
   5809             squeeze=squeeze,
-> 5810             observed=observed,
   5811         )
   5812 


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    407                 sort=sort,
    408                 observed=observed,
--> 409                 mutated=self.mutated,
    410             )
    411 


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    596                 in_axis, name, level, gpr = False, None, gpr, None
    597             else:
--> 598                 raise KeyError(gpr)
    599         elif isinstance(gpr, Grouper) and gpr.key is not None:
    600             # Add key to exclusions


KeyError: ('action_type', 'combined_shot_type')
data2_join=data2.assign(join=data2['combined_shot_type']+data2['action_type'])
data2_join.head()
action_typecombined_shot_typegame_event_idgame_idlatloc_xloc_ylonminutes_remainingperiod...shot_typeshot_zone_areashot_zone_basicshot_zone_rangeteam_idteam_namegame_datematchupopponentjoin
shot_id
1Jump ShotJump Shot102000001233.972316772-118.1028101...2PT Field GoalRight Side(R)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPORJump ShotJump Shot
2Jump ShotJump Shot122000001234.0443-1570-118.4268101...2PT Field GoalLeft Side(L)Mid-Range8-16 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPORJump ShotJump Shot
3Jump ShotJump Shot352000001233.9093-101135-118.370871...2PT Field GoalLeft Side Center(LC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPORJump ShotJump Shot
4Jump ShotJump Shot432000001233.8693138175-118.131861...2PT Field GoalRight Side Center(RC)Mid-Range16-24 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPORJump ShotJump Shot
5Driving Dunk ShotDunk1552000001234.044300-118.269862...2PT Field GoalCenter(C)Restricted AreaLess Than 8 ft.1610612747Los Angeles Lakers2000/10/31LAL @ PORPORDunkDriving Dunk Shot

5 rows × 25 columns

data2_join1=data2['combined_shot_type']+'+'+data2['action_type']
data2_join1.head(100)
shot_id
1         Jump Shot+Jump Shot
2         Jump Shot+Jump Shot
3         Jump Shot+Jump Shot
4         Jump Shot+Jump Shot
5      Dunk+Driving Dunk Shot
                ...          
96        Jump Shot+Jump Shot
97        Jump Shot+Jump Shot
98        Jump Shot+Jump Shot
99        Jump Shot+Jump Shot
100       Jump Shot+Jump Shot
Length: 100, dtype: object
data2_join1.nunique()
57
data2_join1.value_counts()
#最多的是Jump Shot+Jump Shot
Jump Shot+Jump Shot                             18880
Layup+Layup Shot                                 2567
Layup+Driving Layup Shot                         1978
Jump Shot+Turnaround Jump Shot                   1057
Jump Shot+Fadeaway Jump Shot                     1048
Jump Shot+Running Jump Shot                       926
Jump Shot+Pullup Jump shot                        476
Jump Shot+Turnaround Fadeaway shot                439
Dunk+Slam Dunk Shot                               411
Layup+Reverse Layup Shot                          395
Jump Shot+Jump Bank Shot                          333
Dunk+Driving Dunk Shot                            310
Dunk+Dunk Shot                                    262
Tip Shot+Tip Shot                                 182
Dunk+Alley Oop Dunk Shot                          122
Jump Shot+Step Back Jump shot                     118
Jump Shot+Floating Jump shot                      114
Layup+Driving Reverse Layup Shot                   97
Hook Shot+Hook Shot                                84
Layup+Driving Finger Roll Shot                     82
Layup+Alley Oop Layup shot                         80
Dunk+Reverse Dunk Shot                             75
Layup+Running Layup Shot                           72
Bank Shot+Turnaround Bank shot                     71
Layup+Driving Finger Roll Layup Shot               69
Bank Shot+Running Bank shot                        48
Dunk+Driving Slam Dunk Shot                        48
Hook Shot+Running Hook Shot                        41
Layup+Finger Roll Layup Shot                       33
Jump Shot+Fadeaway Bank shot                       31
Layup+Finger Roll Shot                             28
Jump Shot+Driving Jump shot                        28
Jump Shot+Jump Hook Shot                           24
Dunk+Running Dunk Shot                             19
Dunk+Reverse Slam Dunk Shot                        16
Layup+Putback Layup Shot                           15
Dunk+Follow Up Dunk Shot                           15
Hook Shot+Turnaround Hook Shot                     14
Hook Shot+Driving Hook Shot                        14
Bank Shot+Pullup Bank shot                         12
Layup+Running Reverse Layup Shot                   11
Layup+Running Finger Roll Layup Shot                6
Layup+Cutting Layup Shot                            6
Bank Shot+Driving Bank shot                         5
Bank Shot+Hook Bank Shot                            5
Dunk+Putback Dunk Shot                              5
Jump Shot+Driving Floating Jump Shot                5
Layup+Running Finger Roll Shot                      4
Jump Shot+Running Pull-Up Jump Shot                 4
Layup+Turnaround Finger Roll Shot                   2
Tip Shot+Running Tip Shot                           2
Layup+Tip Layup Shot                                2
Dunk+Putback Slam Dunk Shot                         2
Dunk+Running Slam Dunk Shot                         1
Jump Shot+Turnaround Fadeaway Bank Jump Shot        1
Jump Shot+Driving Floating Bank Jump Shot           1
Layup+Cutting Finger Roll Layup Shot                1
dtype: int64
data2['opponent'].value_counts()
#遭遇最多的是SAS
SAS    1978
PHX    1781
HOU    1666
SAC    1643
DEN    1642
POR    1539
UTA    1490
MIN    1474
GSW    1356
LAC    1285
DAL    1199
MEM    1030
BOS     946
SEA     828
IND     761
PHI     720
ORL     719
DET     715
OKC     677
TOR     664
NYK     657
MIA     627
CHA     620
CLE     619
CHI     610
WAS     600
MIL     586
NOH     581
NJN     520
ATL     519
NOP     344
VAN     246
BKN      55
Name: opponent, dtype: int64

经验总结

本次是第二次参加组队学习,这次是针对pandas的学习
在这个练习中暂时没有遇到什么问题,除了在课后习题中第一题的第三问做不出来,之后等学习到那个时候在解决。其次是自己的解决方案和参考答案方法不同,但结果相同

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值