1. eval使用方法
eval():将字符串string对象转化为有效的表达式参与求值运算返回计算结果;换句话说,eval()可以直接将字符串类型的公式或列表在使用eval()处理后识别为python可直接处理的公式或列表
语法上:调用的是:eval(expression,globals=None, locals=None)返回的是计算结果
其中:
expression是一个参与计算的python表达式
globals是可选的参数,如果设置属性不为None的话,就必须是dictionary对象了
2. 用for循环直接调取列表里的元组
ex:
dup_amount_proportion = [('eighty', 0.8)] #定义一个列表,列表包含元素为一个元组
for name, proportion in dup_amount_proportion:
print(name, proportion)
eighty 0.8
type(('eighty', 0.8))
<class 'tuple'>
3. 在pandas里找到重复项
ex:
import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print (frame)
result:
key1 key2 data
0 1 2 5
1 2 2 6
2 3 1 2
3 1 2 6
4 2 2 1
5 3 4 6
6 2 2 2
7 2 2 8
##如下输入 dataframe.duplicated([“colmns1”,“colmns2”])得到不显示第一个重复项的所##有重复值
frame[frame.duplicated(['key1','key2'])]
result:
key1 key2 data
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
官方解释duplicated:
DataFrame.duplicated(subset=None, keep=‘first’)[source]
Return boolean Series denoting duplicate rows, optionally only considering certain columns.它可以有条件的返回重复项的行
Parameters:
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns
keep : {‘first’, ‘last’, False}, default ‘first’ 用keep进行条件选择,默认是“first”
first : Mark duplicates as True except for the first occurrence.如果是first,则除了重复项的第一行不返回,其他都返回
last : Mark duplicates as True except for the last occurrence. 如果是last,则除了重复项的最后一行不返回,其他都返回
False : Mark all duplicates as True. 如果是false,则返回所有重复项
Returns:
duplicated : Series
4. 用select_dtypes在datafram里返回特定类型的列
官方解释:
DataFrame.select_dtypes(include=None, exclude=None)[source]
Return a subset of the DataFrame’s columns based on the column dtypes.基于列的类型,返回选定列类型的子列
Parameters:
include, exclude : scalar or list-like:主要参数是include和exclude
A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
Returns:
subset : DataFrame:返回的是总dataframe里的子dataframe
The subset of the frame including the dtypes in include and excluding the dtypes in exclude.
Raises:
ValueError
If both of include and exclude are empty
If include and exclude have overlapping elements
If any kind of string dtype is passed in.
ex:
>>> df = pd.DataFrame({'a': [1, 2] * 3,
... 'b': [True, False] * 3,
... 'c': [1.0, 2.0] * 3})
>>> df
a b c
0 1 True 1.0
1 2 False 2.0
2 1 True 1.0
3 2 False 2.0
4 1 True 1.0
5 2 False 2.0
>>> df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
4 True
5 False
>>> df.select_dtypes(include=['float64'])
c
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
5 2.0
>>> df.select_dtypes(exclude=['int'])
b c
0 True 1.0
1 False 2.0
2 True 1.0
3 False 2.0
4 True 1.0
5 False 2.0
5. pandas中groupby使用
任何groupby操作都会涉及到下面的三个操作之一:
Splitting:分割数据
Applying:应用一个函数
Combining:合并结果
在许多情况下,我们将数据分成几组,并在每个子集上应用一些功能。在应用中,我们可以执行以下操作:
Aggregation :计算一些摘要统计
Transformation :执行一些特定组的操作
Filtration:根据某些条件下丢弃数据
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df)
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
Pandas对象可以拆分为任何对象。分割对象的方法有多种:
obj.groupby(‘key’)
obj.groupby([‘key1’,‘key2’])
obj.groupby(key,axis=1)
df.groupby('Team')
#它会返回一个对象
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001B33FFA0DA0>
# 查看分组
df.groupby('Team').groups
{'Devils': Int64Index([2, 3], dtype='int64'),
'Kings': Int64Index([4, 6, 7], dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
'Royals': Int64Index([9, 10], dtype='int64'),
'kings': Int64Index([5], dtype='int64')}
由多列进行分组
df.groupby(['Team','Year']).groups
{('Devils', 2014): Int64Index([2], dtype='int64'),
('Devils', 2015): Int64Index([3], dtype='int64'),
('Kings', 2014): Int64Index([4], dtype='int64'),
('Kings', 2016): Int64Index([6], dtype='int64'),
('Kings', 2017): Int64Index([7], dtype='int64'),
('Riders', 2014): Int64Index([0], dtype='int64'),
('Riders', 2015): Int64Index([1], dtype='int64'),
('Riders', 2016): Int64Index([8], dtype='int64'),
('Riders', 2017): Int64Index([11], dtype='int64'),
('Royals', 2014): Int64Index([9], dtype='int64'),
('Royals', 2015): Int64Index([10], dtype='int64'),
('kings', 2015): Int64Index([5], dtype='int64')}
遍历分组
grouped = df.groupby('Team')
for name,group in grouped:
print(name)
print(group)
Devils
Team Rank Year Points
2 Devils 2 2014 863
3 Devils 3 2015 673
Kings
Team Rank Year Points
4 Kings 3 2014 741
6 Kings 1 2016 756
7 Kings 1 2017 788
Riders
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
8 Riders 2 2016 694
11 Riders 2 2017 690
Royals
Team Rank Year Points
9 Royals 4 2014 701
10 Royals 1 2015 804
kings
Team Rank Year Points
5 kings 4 2015 812
用get_group()来选取一个分组
grouped = df.groupby('Year')
print(grouped.get_group(2014))
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701