3 pandas 常用方法与函数

weixin_44360866

已于 2022-10-05 00:23:55 修改

阅读量1.5k

点赞数 1

分类专栏： pandas记录文章标签： pandas python 数据分析

于 2022-07-22 14:43:19 首次发布

本文链接：https://blog.csdn.net/weixin_44360866/article/details/125931782

版权

pandas记录专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1.df.head()默认显式前面5行 , df.tail()默认显式最后5行

df = pd.read_csv(r'data\table.csv')
print(df.head())
print(df.tail())

  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+
   School Class    ID Gender   Address  Height  Weight  Math Physics
30    S_2   C_4  2401      F  street_2     192      62  45.3       A
31    S_2   C_4  2402      M  street_7     166      82  48.7       B
32    S_2   C_4  2403      F  street_6     158      60  59.7      B+
33    S_2   C_4  2404      F  street_2     160      84  67.7       B
34    S_2   C_4  2405      F  street_6     193      54  47.6       B

可以选择显示多少行。

print(df.head(10))

2.df[列名].unique()找出df一列里的唯一值，返回唯一值的ndarray；df[列名].nunique()返回唯一值的数量

print(df['School'].unique(), df['Address'].unique())
print(df['School'].nunique(), df['Address'].nunique())

['S_1' 'S_2'] ['street_1' 'street_2' 'street_4' 'street_5' 'street_6' 'street_7']
2 6

3.count方法返回df的一列非缺失值元素个数;value_counts方法返回df的一列每个元素有多少个

value_counts(values,sort=True, ascending=False, normalize=False,bins=None,dropna=True)
value_counts默认会进行排序，并且是降序，即由大到小。

print(df['Height'].count(), df['Physics'].count())
print(df['Physics'].value_counts())

35 35
B+    9
B     8
B-    6
A     4
A+    3
A-    3
C     2
Name: Physics, dtype: int64

4.df.describe()默认统计数值型数据的各个统计量;df.info()返回有哪些列、多少缺省值、每列的类型

describe的分位可以自己选

print(df.describe(), '\n')
# 最后几行是输出分位数，这个可以自行选择
print(df.describe(percentiles=[.05, .25, .75, .95]), '\n')

               ID      Height      Weight       Math
count    35.00000   35.000000   35.000000  35.000000
mean   1803.00000  174.142857   74.657143  61.351429
std     536.87741   13.541098   12.895377  19.915164
min    1101.00000  155.000000   53.000000  31.500000
25%    1204.50000  161.000000   63.000000  47.400000
50%    2103.00000  173.000000   74.000000  61.700000
75%    2301.50000  187.500000   82.000000  77.100000
max    2405.00000  195.000000  100.000000  97.000000 

               ID      Height      Weight       Math
count    35.00000   35.000000   35.000000  35.000000
mean   1803.00000  174.142857   74.657143  61.351429
std     536.87741   13.541098   12.895377  19.915164
min    1101.00000  155.000000   53.000000  31.500000
5%     1102.70000  157.000000   56.100000  32.640000
25%    1204.50000  161.000000   63.000000  47.400000
50%    2103.00000  173.000000   74.000000  61.700000
75%    2301.50000  187.500000   82.000000  77.100000
95%    2403.30000  193.300000   97.600000  90.040000
max    2405.00000  195.000000  100.000000  97.000000

非数值型使用describe的情况

# 非数值型使用describe
print(df['Physics'].describe(), '\n')

count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

info()

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    35 non-null     object 
 2   ID       35 non-null     int64  
 3   Gender   35 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   35 non-null     int64  
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object 
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB
None

5.idxmax返回最大值所在索引；nlargest返回前几个大的元素值，还有对应的index

print(df['Math'].idxmax())
print(df['Math'].idxmin(), '\n')
print(df['Math'].nlargest(3))
print(df['Math'].nsmallest(3))

5
10 

5     97.0
28    95.5
11    87.7
Name: Math, dtype: float64
10    31.5
1     32.5
26    32.7
Name: Math, dtype: float64

6.clip对于超过or低于某些值的数截断;replace替换值

注意：clip有inplace操作

print(df['Math'].head())
print(df['Math'].clip(33, 80).head(8), '\n')  # 截断掉的部分不包括33和80

0    34.0
1    32.5
2    87.2
3    80.4
4    84.8
Name: Math, dtype: float64
0    34.0
1    33.0
2    80.0
3    80.0
4    80.0
5    80.0
6    63.5
7    58.8
Name: Math, dtype: float64

replace也有inplace操作。
1.可以直接取一列对里面的值替换，用list修改。 2.不用先取一列，直接在df上用字典修改。

# 先把一列取出来，然后用list修改
print(df['Address'].replace(['street_1', 'street_2'], ['one', 'two']).head(5))
# 不用先取一列，直接在df上用字典修改
print(df.replace({'Address': {'street_1': 'ONE', 'street_2': 'TWO'}}).head())

0         one
1         two
2         two
3         two
4    street_4
Name: Address, dtype: object
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M       ONE     173      63  34.0      A+
1    S_1   C_1  1102      F       TWO     192      73  32.5      B+
2    S_1   C_1  1103      M       TWO     186      82  87.2      B+
3    S_1   C_1  1104      F       TWO     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

7.MAD:mean absolute deviation平均绝对偏差

print(df['Weight'].mad())

8.apply迭代每一列的值操作★

输入的参数是一个函数/可调用对象

# 对Series操作
print(df['Math'].apply(lambda x: str(x) + '!').head(), '\n')
# 对dataframe进行操作，默认axis=0
print(df.apply(lambda x: x.apply(lambda x: str(x) + '~')).head())

0    34.0!
1    32.5!
2    87.2!
3    80.4!
4    84.8!
Name: Math, dtype: object 

  School Class     ID Gender    Address Height Weight   Math Physics
0   S_1~  C_1~  1101~     M~  street_1~   173~    63~  34.0~     A+~
1   S_1~  C_1~  1102~     F~  street_2~   192~    73~  32.5~     B+~
2   S_1~  C_1~  1103~     M~  street_2~   186~    82~  87.2~     B+~
3   S_1~  C_1~  1104~     F~  street_2~   167~    81~  80.4~     B-~
4   S_1~  C_1~  1105~     F~  street_4~   159~    64~  84.8~     B+~

常用函数还有

sum/mean/median/mad/min/max/abs/std/var/quantile/cummax/cumsum/cumprod

dropna()

删除带有NaN的行。

pd.set_option()

设置打印时显示的格式。

import warnings
warnings.filterwarnings('ignore')

# 设置显示最大的列数（None表示全部显示）和行数
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
# 设置显示小数保留的位数
pd.set_option('display.float_format',lambda x: '%.2f'%x)
# pandas设置显示宽度
pd.set_option('display.width', 100) 
# 设置显示数值的精度
pd.set_option('precision', 1)

# 打印numpy时设置显示宽度，并且不用科学计数法显示
np.set_printoptions(linewidth=100, suppress=True)

df.drop()

删除表中的某一行or某一列。
默认inplace=False，也就是不改变原有的dataframe，返回一个新的dataframe。
如果想要在原有的dataframe上进行操作，可以设置参数inplace=True.
学习链接：link

pd.factorize()

pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)

将对象编码为枚举类型或分类变量。
使用kaggle的钻石数据集测试了一下：

codes, unique = pd.factorize(diamonds_data['cut'])
print(codes)
print(unique)

[0 1 2 ... 3 1 0]
Index(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype='object')

还可以设置参数sort=True，对unique进行排序后，再编码：

codes, unique = pd.factorize(diamonds_data['cut'], sort=True)
print(codes)
print(unique)

[2 3 1 ... 4 3 2]
Index(['Fair', 'Good', 'Ideal', 'Premium', 'Very Good'], dtype='object')