pandas nan判断_Pandas使用技巧（1）-CSDN博客

Pandas是Python语言中十分实用的数据分析模块，在分析矩阵数据时运用广泛。下面介绍一些我在学习和使用pandas的过程中，常用到的一些操作，读者也可以在python编译器中按照下面的代码一行一行输入学习。

首先导入需要的模块

>>> import pandas as pd,numpy as np>>> from functools import reduce

制作空的DataFrame

>>>df=pd.DataFrame(columns=list(range(4)))>>> dfEmpty DataFrameColumns: [0, 1, 2, 3]Index: []

判断DataFrame是否为空

>>> if df.empty:...  print('DataFrame is empty!')...DataFrame is empty!

添加行

>>> df.loc[0]=[1,3,5,np.nan]>>> df     0    1    2   30  1.0  3.0  5.0 NaN

添加多行

>>> df=df.append([[2,4,np.nan,np.nan],[3,5,np.nan,np.nan],[3,np.nan,np.nan,np.nan]],ignore_index=True)>>> df     0    1    2   30  1.0  3.0  5.0 NaN1  2.0  4.0  NaN NaN2  3.0  5.0  NaN NaN3  3.0  NaN  NaN NaN

按照指定的列名顺序排列

>>> cols=[1,0,3,2]>>> df_tmp=df[cols]>>> df_tmp     1    0   3    20  3.0  1.0 NaN  5.01  4.0  2.0 NaN  NaN2  5.0  3.0 NaN  NaN3  NaN  3.0 NaN  NaN

删除列中NaN多于2的列

>>> df     0    1    2   30  1.0  3.0  5.0 NaN1  2.0  4.0  NaN NaN2  3.0  5.0  NaN NaN3  3.0  NaN  NaN NaN>>> df.loc[:, (df.isnull().sum(axis=0) <= 2)]     0    10  1.0  3.01  2.0  4.02  3.0  5.03  3.0  NaN

删除全是0的列

>>> df[4]=0>>> df     0    1    2   3  40  1.0  3.0  5.0 NaN  01  2.0  4.0  NaN NaN  02  3.0  5.0  NaN NaN  03  3.0  NaN  NaN NaN  0>>> df.loc[:, (df != 0).any(axis=0)]     0    1    2   30  1.0  3.0  5.0 NaN1  2.0  4.0  NaN NaN2  3.0  5.0  NaN NaN3  3.0  NaN  NaN NaN

筛选有两个NAN值的行：

>>> df[df.isnull().sum(axis=1)==2]     0    1   2   3  41  2.0  4.0 NaN NaN  02  3.0  5.0 NaN NaN  0

筛选出0值数目在[1,3]区间的列

>>> df[5]=[1,0,0,1]>>> df[6]=[1,0,1,1]>>> df     0    1    2   3  4  5  60  1.0  3.0  5.0 NaN  0  1  11  2.0  4.0  NaN NaN  0  0  02  3.0  5.0  NaN NaN  0  0  13  3.0  NaN  NaN NaN  0  1  1>>> df.loc[:,((df==0).sum()>=1) & ((df==0).sum()<=3)]   5  60  1  11  0  02  0  13  1  1

检测5,6列是否一致(可以将结果赋值为df新的一列)

>>> df[5]==df[6]0     True1     True2    False3     Truedtype: bool

添加一列，判断某元素是否在这一列中

>>> df[7]=["A","B","C","D"]>>> 'A' in df[7].valuesTrue

merge多个dataframe，下面从df新生成两个df1、df2，按照列名为7那一列merge。

>>> df1=df.loc[:,3:]>>> df2=df.loc[:,4:]>>> df1    3  4  5  6  70 NaN  0  1  1  A1 NaN  0  0  0  B2 NaN  0  0  1  C3 NaN  0  1  1  D>>> df2   4  5  6  70  0  1  1  A1  0  0  0  B2  0  0  1  C3  0  1  1  D>>> reduce(lambda left,right: pd.merge(left,right,on=7), [df,df1,df2])     0    1    2  3_x  4_x  5_x  6_x  7  3_y  4_y  5_y  6_y  4  5  60  1.0  3.0  5.0  NaN    0    1    1  A  NaN    0    1    1  0  1  11  2.0  4.0  NaN  NaN    0    0    0  B  NaN    0    0    0  0  0  02  3.0  5.0  NaN  NaN    0    0    1  C  NaN    0    0    1  0  0  13  3.0  NaN  NaN  NaN    0    1    1  D  NaN    0    1    1  0  1  1

函数(max：求最大值)作用行，得到新的列。

>>> max(1,2)2>>> df[8]=df.apply(lambda row: max(row[5], row[6]), axis=1)>>> df     0    1    2   3  4  5  6  7  80  1.0  3.0  5.0 NaN  0  1  1  A  11  2.0  4.0  NaN NaN  0  0  0  B  02  3.0  5.0  NaN NaN  0  0  1  C  13  3.0  NaN  NaN NaN  0  1  1  D  1

利用函数得到多个返回值，构成新的几列

>>> def sum_max(row):...  n_sum=sum(row[4:7])...  n_max=max(row[4:7])...  return n_sum,n_max...>>> df["n_sum"],df["n_max"]=zip(*df.apply(sum_max,axis=1))>>> df     0    1    2   3  4  5  6  7  8  n_sum  n_max0  1.0  3.0  5.0 NaN  0  1  1  A  1      2      11  2.0  4.0  NaN NaN  0  0  0  B  0      0      02  3.0  5.0  NaN NaN  0  0  1  C  1      1      13  3.0  NaN  NaN NaN  0  1  1  D  1      2      1

挑选前缀为特定字符的列

>>> df.loc[:,df.columns.astype(str).str.startswith("n_")]   n_sum  n_max0      2      11      0      02      1      13      2      1或者>>> df.filter(regex=r'^n_',axis=1)   n_sum  n_max0      2      11      0      02      1      13      2      1

找出4,5,6列每行最大值对应的列名

>>> df.iloc[:,4:7]   4  5  60  0  1  11  0  0  02  0  0  13  0  1  1>>> df.iloc[:,4:7].idxmax(axis=1)0    51    42    63    5dtype: int64

7,8列组成字典

>>> mydict = dict(zip(df[7], df[8]))>>> mydict{'A': 1, 'B': 0, 'C': 1, 'D': 1}

7,8列字符拼接，形成新的一列

>>> df[9]=df[7]+"_"+df[8].astype(str)>>> df     0    1    2   3  4  5  6  7  8  n_sum  n_max    90  1.0  3.0  5.0 NaN  0  1  1  A  1      2      1  A_11  2.0  4.0  NaN NaN  0  0  0  B  0      0      0  B_02  3.0  5.0  NaN NaN  0  0  1  C  1      1      1  C_13  3.0  NaN  NaN NaN  0  1  1  D  1      2      1  D_1

替换9列的‘_’为‘-’

>>> df[9]=df[9].str.replace('_','-')>>> df     0    1    2   3  4  5  6  7  8  n_sum  n_max    90  1.0  3.0  5.0 NaN  0  1  1  A  1      2      1  A-11  2.0  4.0  NaN NaN  0  0  0  B  0      0      0  B-02  3.0  5.0  NaN NaN  0  0  1  C  1      1      1  C-13  3.0  NaN  NaN NaN  0  1  1  D  1      2      1  D-1

9列切分为两列

>>> df[9].str.split("-",expand=True)   0  10  A  11  B  02  C  13  D  1

使用字典替换元素(替换n_max列)

>>> di={1:"AA",0:"BB"}>>> df.replace({"n_max":di})      0    1    2   3  4  5  6  7  8  n_sum n_max    90   1.0  3.0  5.0 NaN  0  1  1  A  1      2    AA  A_11   2.0  4.0  NaN NaN  0  0  0  B  0      0    BB  B_02  10.0  5.0  NaN NaN  0  0  1  C  1      1    AA  C_13  10.0  NaN  NaN NaN  0  1  1  D  1      2    AA  D_1

使用字典更换列名

>>> d={9:"syb"}>>> df.rename(columns=d)     0    1    2   3  4  5  6  7  8  n_sum  n_max  syb0  1.0  3.0  5.0 NaN  0  1  1  A  1      2      1  A_11  2.0  4.0  NaN NaN  0  0  0  B  0      0      0  B_02  3.0  5.0  NaN NaN  0  0  1  C  1      1      1  C_13  3.0  NaN  NaN NaN  0  1  1  D  1      2      1  D_1

7列作为index(行名)

>>> df.set_index(7)     0    1    2   3  4  5  6  8  n_sum  n_max    97                                                  A  1.0  3.0  5.0 NaN  0  1  1  1      2      1  A_1B  2.0  4.0  NaN NaN  0  0  0  0      0      0  B_0C  3.0  5.0  NaN NaN  0  0  1  1      1      1  C_1D  3.0  NaN  NaN NaN  0  1  1  1      2      1  D_1

0列中大于2的值全替换为10

>>> df[0].values[df[0].values > 2] = 10>>> df      0    1    2   3  4  5  6  7  8  n_sum  n_max    90   1.0  3.0  5.0 NaN  0  1  1  A  1      2      1  A_11   2.0  4.0  NaN NaN  0  0  0  B  0      0      0  B_02  10.0  5.0  NaN NaN  0  0  1  C  1      1      1  C_13  10.0  NaN  NaN NaN  0  1  1  D  1      2      1  D_1

筛选9列包含0字符的行

>>> df[df[9].str.contains("0")]     0    1   2   3  4  5  6  7  8  n_sum  n_max    91  2.0  4.0 NaN NaN  0  0  0  B  0      0      0  B_0

pandas nan判断_Pandas使用技巧（1）

Pandas是Python语言中十分实用的数据分析模块，在分析矩阵数据时运用广泛。下面介绍一些我在学习和使用pandas的过程中，常用到的一些操作，读者也可以在python编译器中按照下面的代码一行一行输入学习。

篇幅所限，今天关于pandas的使用技巧就介绍到这里，希望能对大家有所帮助，请持续关注我们，谢谢！