TASK2 索引

最新推荐文章于 2021-09-29 00:15:33 发布

Youdef

最新推荐文章于 2021-09-29 00:15:33 发布

阅读量230

点赞数

分类专栏：机器学习与数据分析文章标签：数据分析 numpy python

本文链接：https://blog.csdn.net/youdef/article/details/105714537

版权

机器学习与数据分析专栏收录该内容

4 篇文章 0 订阅

订阅专栏

基础不牢地动山摇啊！今天这部分做的有些迷糊，主要还是对索引运用不当，即稀里糊涂的。
先打卡，然后重新梳理一下。

一、单级索引

1. loc方法、iloc方法、[]操作符

行用loc （标签索引），列用[]，位置用iloc，条件用bool/query，标量用at/iat

单行索引、多行索引、单列索引、多列索引、联合索引、函数式索引、bool索引

（b）iloc方法（注意与loc不同，切片右端点不包含）

iloc接收的参数只能是整数、整数列表、bool列表，因此回忆到：iloc里的Series参数是用.values转换的。

（c.1）Series的[]操作

（c.2）DataFrame的[]操作

2.bool索引

3.快速标量索引

4.区间索引

interval_range方法创建区间；cut转换

二、多级索引

1.创建多级索引

通过from_tuples from_arrays ：直接创建元组、zip创建元组、Array创建

pd.MultiIndex.from_tuples()

元组 tuples= [(‘A’,‘a’),(‘A’,‘b’),(‘B’,‘a’),(‘B’,‘b’),]

tuples = list(zip(L1, L2))

arrays = [[‘A’,‘a’],[‘A’,‘b’],[‘B’,‘a’],[‘B’,‘b’]]
mul_index = pd.MultiIndex.from_tuples(arrays, names=(‘Upper’, ‘Lower’))
pd.DataFrame({‘Score’:[‘perfect’,‘good’,‘fair’,‘bad’]},index=mul_index)

from_product

指定df中的列 set_index()

2.多层切片索引

df_using_mul.sort_index().loc[(‘C_2’,‘street_6’)😦‘C_3’,‘street_4’)]

3. 多层索引中的slice对象

4.索引层交换

df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head() #两层交换

三、索引设定

index_col参数

index_col是read_csv中的一个参数,指定列为索引

reindex和reindex_like

reindex指重新索引，索引对齐，多用于重新排序

缺失值填充：fill_value 和 method(bfill用索引后一个有效行填充/ ffill / nearest 更近，数据需单调)

reindex_like :复制一个一样的DataFrame，需要指定列名

set_index和reset_index

set_index方法：将某些列作为索引

reset_index方法：将索引重置，默认将索引恢复到自然数索引。用level参数指定哪一层被reset，用col_level参数指定set到哪一层。

rename_axis和rename

rename_axis方法针对多级索引，用于修改某一层的索引名，而不是索引标签

rename方法用于修改列或者行索引标签，而不是索引名.

四、常用索引型函数

where函数

当对条件为False的单元进行填充(NaN)：

mask函数

对条件为True的单元进行填充

3、query函数

# df.where(df['Gender']=='M').head() #不满足条件的行全部被设置为NaN
df.where(df['Gender']=='M').dropna().head() #dropna 方法值去除NaN
## 与[]操作相似

df.where(df['Gender']=='M', np.random.rand(df.shape[0], df.shape[1])).head() #将Nan填充

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173.000000	63.000000	34.000000	A+
1102	0.951112	0.799791	0.0307497	0.128599	0.104629	0.187326	0.869920	0.435778
1103	S_1	C_1	M	street_2	186.000000	82.000000	87.200000	B+
1104	0.90324	0.947564	0.34649	0.0282892	0.059027	0.468834	0.536150	0.679534
1105	0.324071	0.309613	0.301033	0.0838299	0.732232	0.673442	0.161185	0.496035

df.query("(Address in ['street_6','street_7']) & (Weight>(70+10)) & (ID in [1303,2304,2402])")
# 行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符
# 有点像SQL查询

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1303	S_1	C_3	M	street_7	188	82	49.7	B
2304	S_2	C_3	F	street_6	164	81	95.5	A-
2402	S_2	C_4	M	street_7	166	82	48.7	B

五、重复元素处理

duplicated方法

返回是否重复的布尔列表。

drop_duplicates方法

剔除重复项，例如需要保留每组的第一个值。

df.duplicated('Class',).head()
# 参数keep默认为first，即首次出现设为不重复，
# 若为last，则最后一次设为不重复，若为False，则所有重复项为True

ID
1101    False
1102     True
1103     True
1104     True
1105     True
dtype: bool

# df.drop_duplicates('Class', keep='last')
df.drop_duplicates(['Class', 'School'])

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
2101	S_2	C_1	M	street_7	174	84	83.3	C
2201	S_2	C_2	M	street_5	193	100	39.1	B
2301	S_2	C_3	F	street_4	157	78	72.3	B+
2401	S_2	C_4	F	street_2	192	62	45.3	A

六、抽样函数sample

n为样本量， frac为抽样比， replace为是否回放，

axis为抽样维度，默认为0

weights为样本权重，自动归一化 ??

# df.sample(n=5,)
df.sample(frac=0.05)
df.sample(n=df.shape[0], replace=True).head() #为此抽取有重复出现可能性
df.sample(n=df.shape[0], replace=True).index.is_unique #

df.sample(n=3, axis=1).head() #抽行

	School	Class	Weight
ID
1101	S_1	C_1	63
1102	S_1	C_1	73
1103	S_1	C_1	82
1104	S_1	C_1	81
1105	S_1	C_1	64

1. 问题

【问题二】如果要选出DataFrame的某个子集，请给出尽可能多的方法实现。

【问题三】 query函数比其他索引方法的速度更慢吗？在什么场合使用什么索引最高效？

【问题四】单级索引能使用Slice对象吗？能的话怎么使用，请给出一个例子。

【问题五】如何快速找出某一列的缺失值所在索引？

【问题六】索引设定中的所有方法分别适用于哪些场合？怎么直接把某个DataFrame的索引换成任意给定同长度的索引？

【问题七】多级索引有什么适用场合？

【问题八】对于多层索引，怎么对内层进行条件筛选？

【问题九】什么时候需要重复元素处理？

df = pd.read_csv('data/table.csv')
df.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

#### 【问题一】 如何更改列或行的顺序？如何交换奇偶行（列）的顺序？
# 答： 
df.loc[::-1, ::-1].head()

	Physics	Math	Weight	Height	Address	Gender	ID	Class	School
34	B	47.6	54	193	street_6	F	2405	C_4	S_2
33	B	67.7	84	160	street_2	F	2404	C_4	S_2
32	B+	59.7	60	158	street_6	F	2403	C_4	S_2
31	B	48.7	82	166	street_7	M	2402	C_4	S_2
30	A	45.3	62	192	street_2	F	2401	C_4	S_2

#### 交换指定行

2. 练习

【练习一】现有一份关于UFO的数据集，请解决下列问题：

data = pd.read_csv('data/UFO.csv')
data.head()

	datetime	shape	duration (seconds)	latitude	longitude
0	10/10/1949 20:30	cylinder	2700.0	29.883056	-97.941111
1	10/10/1949 21:00	light	7200.0	29.384210	-98.581082
2	10/10/1955 17:00	circle	20.0	53.200000	-2.916667
3	10/10/1956 21:00	circle	20.0	28.978333	-96.645833
4	10/10/1960 20:00	light	900.0	21.418056	-157.803611

#### （a）在所有被观测时间超过60s的时间中，哪个形状最多？

data[data["duration (seconds)"]>60]["shape"].value_counts().index[0]

'light'

#### （b）对经纬度进行划分：-180°至180°以30°为一个经度划分，
# -90°至90°以18°为一个维度划分，请问哪个区域中报告的UFO事件数量最多？

# data_muls = data.set_index(['longitude', 'latitude'])
# -180°至180°以30°为一个经度划分
i_longitude = pd.interval_range(start=-180, end=180, periods=30)
i_longitude
# -90°至90°以18°为一个维度划分
i_latitude = pd.interval_range(start=-90, end=90, periods=18)
i_latitude

IntervalIndex([(-90, -80], (-80, -70], (-70, -60], (-60, -50], (-50, -40] ... (40, 50], (50, 60], (60, 70], (70, 80], (80, 90]],
              closed='right',
              dtype='interval[int64]')

long = pd.cut(data['longitude'], bins=i_longitude) 
long #经度转换

0         (-108, -96]
1         (-108, -96]
2            (-12, 0]
3         (-108, -96]
4        (-168, -156]
             ...     
80327      (-96, -84]
80328    (-120, -108]
80329    (-132, -120]
80330      (-84, -72]
80331     (-108, -96]
Name: longitude, Length: 80332, dtype: category
Categories (30, interval[int64]): [(-180, -168] < (-168, -156] < (-156, -144] < (-144, -132] ... (132, 144] < (144, 156] < (156, 168] < (168, 180]]

lat = pd.cut(data['latitude'], bins=i_latitude)
lat #纬度

0        (20, 30]
1        (20, 30]
2        (50, 60]
3        (20, 30]
4        (20, 30]
           ...   
80327    (30, 40]
80328    (40, 50]
80329    (30, 40]
80330    (30, 40]
80331    (30, 40]
Name: latitude, Length: 80332, dtype: category
Categories (18, interval[int64]): [(-90, -80] < (-80, -70] < (-70, -60] < (-60, -50] ... (50, 60] < (60, 70] < (70, 80] < (80, 90]]

# 区间索引选取  ？？这里没看懂直接复制上面的，待会儿看看群里的讨论
df_i = data.join(long,rsuffix='_interval').join(lat,rsuffix='_interval').reset_index()
df_i.head()

	index	datetime	shape	duration (seconds)	latitude	longitude	longitude_interval	latitude_interval
0	0	10/10/1949 20:30	cylinder	2700.0	29.883056	-97.941111	(-108, -96]	(20, 30]
1	1	10/10/1949 21:00	light	7200.0	29.384210	-98.581082	(-108, -96]	(20, 30]
2	2	10/10/1955 17:00	circle	20.0	53.200000	-2.916667	(-12, 0]	(50, 60]
3	3	10/10/1956 21:00	circle	20.0	28.978333	-96.645833	(-108, -96]	(20, 30]
4	4	10/10/1960 20:00	light	900.0	21.418056	-157.803611	(-168, -156]	(20, 30]

# 对两个columns 我想用zip把它包起来，没写对
# 今天学的设置索引 然后计数的方法可以试试看吧
# df_i[list(zip(df_i['longitude_interval'], df_i['latitude_interval']))]

df_i2 = df_i.set_index(['longitude_interval','latitude_interval']).sort_index()
df_i2.head()
df_i2.index.value_counts()
# 即纬度(-84, -72) 经度(40, 50）间 事件发生多

((-84, -72], (40, 50])        12043
((-120, -108], (30, 40])       9594
((-96, -84], (30, 40])         9055
((-84, -72], (30, 40])         8681
((-96, -84], (40, 50])         7348
                              ...  
((144, 156], (-10, 0])            1
((-84, -72], (60, 70])            1
((120, 132], (-10, 0])            1
((-180, -168], (-20, -10])        1
((12, 24], (20, 30])              1
Length: 173, dtype: int64

【练习二】现有一份关于口袋妖怪的数据集，请解决下列问题：

data = pd.read_csv('data/Pokemon.csv')
data.head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

# （a）双属性的Pokemon占总体比例的多少？
n = data.shape[0] #列数
# data[data['Type 1'].isnull() & data['Type 2'].isnull()]
data[data['Type 1'].isnull() | data['Type 2'].isnull()].count()

#             386
Name          386
Type 1        386
Type 2          0
Total         386
HP            386
Attack        386
Defense       386
Sp. Atk       386
Sp. Def       386
Speed         386
Generation    386
Legendary     386
dtype: int64

print("双属性占比：%f" % ((n-386)/n))

双属性占比：0.517500

# （b）在所有种族值（Total）不小于580的Pokemon中，非神兽（Legendary=False）的比例为多少？

D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until

# （c）在第一属性为格斗系（Fighting）的Pokemon中，物攻排名前三高的是哪些？
data[data['Type 1']=='Fighting'].sort_values('Attack',ascending=False).head(3)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
498	448	LucarioMega Lucario	Fighting	Steel	625	70	145	88	140	70	112	4	False
594	534	Conkeldurr	Fighting	NaN	505	105	140	95	55	65	45	5	False
74	68	Machamp	Fighting	NaN	505	90	130	80	65	85	55	1	False

# （d）请问六项种族指标（HP、物攻、特攻、物防、特防、速度）极差的均值最大的是哪个属性（只考虑第一属性，且均值是对属性而言）？

# （e）哪个属性（只考虑第一属性）神兽占总Pokemon的比例最高？该属性神兽的种族值也是最高的吗？
data['Type 1'].value_counts().head()

Water      112
Normal      98
Grass       70
Bug         69
Psychic     57
Name: Type 1, dtype: int64

Water属性的神兽最多

# data.sort_values('Total',ascending=False).head(10)
data['Total'].unique().max()

# # 查看在第一属性Water下 Total值的情况， 降序
data[data['Type 1']=='Water'].sort_values('Total', ascending=False).head(10)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
422	382	KyogrePrimal Kyogre	Water	NaN	770	100	150	90	180	160	90	3	True
541	484	Palkia	Water	Dragon	680	90	120	100	150	120	100	4	True
421	382	Kyogre	Water	NaN	670	100	100	90	150	140	90	3	True
141	130	GyaradosMega Gyarados	Water	Dark	640	95	155	109	70	130	81	1	False
283	260	SwampertMega Swampert	Water	Ground	635	100	150	110	95	110	70	3	False
12	9	BlastoiseMega Blastoise	Water	NaN	630	79	103	120	135	115	78	1	False
548	490	Manaphy	Water	NaN	600	100	100	100	100	100	100	4	False
87	80	SlowbroMega Slowbro	Water	Psychic	590	95	75	180	130	80	30	1	False
714	647	KeldeoResolute Forme	Water	Fighting	580	91	72	90	129	90	108	5	False
713	647	KeldeoOrdinary Forme	Water	Fighting	580	91	72	90	129	90	108	5	False

第一属性Water下 Total值最高为770。而所有妖怪中Total最高的是780

Youdef

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
TASK2 索引

import numpy as npimport pandas as pddf = pd.read_csv('data/table.csv', index_col='ID')df.head() School Class Gender Address Height Weight ...
复制链接

扫一扫