pandas常用操作以及eda分析笔记(自用)

pandas学习笔记

以sklearn自带的boston数据集转为dataframe为例(这样就不用总是换示例数据了🤔),记录一下常用的pandas操作,包括查找、删减、列数据操作、eda绘图等等。

import pandas as pd
import numpy as np
from  sklearn.datasets import load_boston
from sklearn import datasets
import pandas as pd 
dataset = datasets.load_boston()
print("数据集包含的信息项:")
print("  ".join(dataset.keys()))
# print("数据集描述信息:")


#转为df文件
data = dataset["data"]
target = dataset["target"]
df = pd.DataFrame(data, columns=dataset["feature_names"])
df["target"] = target

数据集包含的信息项:
data  target  feature_names  DESCR  filename
df.info()  #数据表基本信息(维度、列名称、数据格式、所占空间等)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  target   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB
df.describe()   #各个字段的描述
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
df.sample(3) #随机抽样
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
5000.224380.09.690.00.5856.02779.72.49826.0391.019.2396.9014.3316.8
1591.425020.019.580.00.8716.510100.01.76595.0403.014.7364.317.3923.3
2430.1275730.04.930.00.4286.3937.87.03556.0300.016.6374.715.1923.7
df.tail(3)  #最后n个样本
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9
df.head(3)   #最前N个样本
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7

1.定位到行、列

1.1直接索引

df[3:4]  #将数字切片放入一个中括号,输出这一行的所有列。
#错误示范:
#df[3] 
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
df['B']  #将列名字符串放入一个中括号,输出这一列的所有行
0      396.90
1      396.90
2      392.83
3      394.63
4      396.90
        ...  
501    391.99
502    396.90
503    396.90
504    393.45
505    396.90
Name: B, Length: 506, dtype: float64
df[2:5][["B","ZN"]] #两个中括号分别表示行和列的索引范围,要输出多个列的时候用中括号括起来
BZN
2392.830.0
3394.630.0
4396.900.0

经过上述例子,上面一种索引方式有很多达不到的功能,比如要输出4到8列就要把这几列的字段名输入进去,比较麻烦;要输出不连续的行列也没法实现。这时候就有了功能更完善的loc和iloc

1.2 用loc和iloc,其中loc是用字段定位,iloc是用数字索引定位

注意两个函数都是用一个中括号,而不是圆括号or两个中括号,行和列的区别是逗号

#多个列按列名定位
df.loc[2:5,['CRIM','ZN','INDUS']]
CRIMZNINDUS
20.027290.07.07
30.032370.02.18
40.069050.02.18
50.029850.02.18
#多个列按列索引定位
df.iloc[2:5,:3]
CRIMZNINDUS
20.027290.07.07
30.032370.02.18
40.069050.02.18
# 注:指定某一行列返回的是一个具体的值如(df.iloc[1,2]),如果用到了切片返回的是一个series(如df.iloc[:,2])
#返回特定行列位置的值而非pd切片
print('示例一:df.iloc[3,4]。返回的为该位置的值')
print(df.iloc[3,4])  #返回的是该位置的值
print(type(df.iloc[3,4]))


print('示例二:df.iloc[3:5,4]。返回的是一个series')
print(df.iloc[3:5,4])  #返回的是该位置的值
print(type(df.iloc[3:5,4]))
#注:和df直接用两个中括号索引不同,如果是df[3][4]会报错

示例一:df.iloc[3,4]。返回的为该位置的值
0.458
<class 'numpy.float64'>
示例二:df.iloc[3:5,4]。返回的是一个series
3    0.458
4    0.458
Name: NOX, dtype: float64
<class 'pandas.core.series.Series'>
#目标行/列不连续,用[]选出目标行/列
df.iloc[[4,5,6,7,8,10],[1,3,5]]  
ZNCHASRM
40.00.07.147
50.00.06.430
612.50.06.012
712.50.06.172
812.50.05.631
1012.50.06.377

1.3 按多个条件选择列

#按多个条件选择列,多个条件之间用&并起来
df.loc[(df['target']<10) & (df['target']>5)]
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
38420.084900.018.100.00.7004.36891.21.439524.0666.020.2285.8330.638.8
38516.811800.018.100.00.7005.27798.11.426124.0666.020.2396.9030.817.2
38722.597100.018.100.00.7005.00089.51.518424.0666.020.2396.9031.997.4
39211.577900.018.100.00.7005.03697.01.770024.0666.020.2396.9025.689.7
3977.672020.018.100.00.6935.74798.91.633424.0666.020.2393.1019.928.5
3999.916550.018.100.00.6935.85277.81.500424.0666.020.2338.1629.976.3
40025.046100.018.100.00.6935.987100.01.588824.0666.020.2396.9026.775.6
40114.236200.018.100.00.6936.343100.01.574124.0666.020.2396.9020.327.2
40324.801700.018.100.00.6935.34996.01.702824.0666.020.2396.9019.778.3
40441.529200.018.100.00.6935.53185.41.607424.0666.020.2329.4627.388.5
41445.746100.018.100.00.6934.519100.01.658224.0666.020.288.2736.987.0
41518.084600.018.100.00.6796.434100.01.834724.0666.020.227.2529.057.2
41610.834200.018.100.00.6796.78290.81.819524.0666.020.221.5725.797.5
41873.534100.018.100.00.6795.957100.01.802624.0666.020.216.4520.628.8
41911.812300.018.100.00.7186.82476.51.794024.0666.020.248.4522.748.4
42515.860300.018.100.00.6795.89695.41.909624.0666.020.27.6824.398.3
4299.338890.018.100.00.6796.38095.61.968224.0666.020.260.7224.089.5
43614.420800.018.100.00.7406.46193.32.002624.0666.020.227.4918.059.6
43715.177200.018.100.00.7406.152100.01.914224.0666.020.29.3226.458.7
43813.678100.018.100.00.7405.93587.91.820624.0666.020.268.9534.028.4
4890.183370.027.740.00.6095.41498.31.75544.0711.020.1344.0523.977.0
4900.207460.027.740.00.6095.09398.01.82264.0711.020.1318.4329.688.1
#or
df.loc[(df['target']<10) | (df['target']>5)]
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

506 rows × 14 columns

空值处理:展示与填充

#每一列0值个数
df.isna().sum()
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
target     0
dtype: int64
#用0填充
df.fillna(0)
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

506 rows × 14 columns

##或者只需要填补特定列的空值
df.loc[:,"B"].fillna(0)
0      396.90
1      396.90
2      392.83
3      394.63
4      396.90
        ...  
501    391.99
502    396.90
503    396.90
504    393.45
505    396.90
Name: B, Length: 506, dtype: float64

获取行列属性

#方法1:
print(df.columns.values) #输出列的值
#print(df.index.values) #输出行的值
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT' 'target']
#方法2:
list(df)
['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT',
 'target']
df.columns[0]  #返回的是对应列的名称
'CRIM'
print(df.shape[0]) #输出行的个数
print(df.shape[1])#输出列的个数
506
14

新增与删除

注:用于删除的drop函数并不会直接替代原来的df文件,要替换的话可以重新赋值或者将inplace改成true
删除都要用axis选定轴

df["测试列"] = 0 #新增一列直接赋值就行,默认加在最后一列
df
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget测试列
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.00
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.60
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.70
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.40
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.20
................................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.40
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.60
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.90
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.00
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.90

506 rows × 15 columns

##方法1:用列名来删除
df.drop(["测试列"],axis = 1,inplace=True)  #axis=1是删除列,axis参数为0是行,为1是列
df
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
.............................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9

506 rows × 14 columns

#target这一列都加上10,所有这类操作都可以用apply+匿名函数实现
df['target'].apply(lambda x:x+10)
0      34.0
1      31.6
2      44.7
3      43.4
4      46.2
       ... 
501    32.4
502    30.6
503    33.9
504    32.0
505    21.9
Name: target, Length: 506, dtype: float64

##方法2:用列索引来删除
list = [1,3,5,6]
df.drop(df.columns[list],axis =1)
CRIMINDUSNOXDISRADTAXPTRATIOBLSTATtarget
00.006322.310.5384.09001.0296.015.3396.904.9824.0
10.027317.070.4694.96712.0242.017.8396.909.1421.6
20.027297.070.4694.96712.0242.017.8392.834.0334.7
30.032372.180.4586.06223.0222.018.7394.632.9433.4
40.069052.180.4586.06223.0222.018.7396.905.3336.2
.................................
5010.0626311.930.5732.47861.0273.021.0391.999.6722.4
5020.0452711.930.5732.28751.0273.021.0396.909.0820.6
5030.0607611.930.5732.16751.0273.021.0396.905.6423.9
5040.1095911.930.5732.38891.0273.021.0393.456.4822.0
5050.0474111.930.5732.50501.0273.021.0396.907.8811.9

506 rows × 10 columns

##字段的新建,用原有的字段得到一个新字段

# df.eval("""新字段1=字段1+字段2
#          新字段2=字段1+字段2
#          新字段3=字段1+字段2""",inplace=False)

文件导出与处理

#df.to_csv()

其他应用

返回pandas列数据对应的数组

np.array(df)
df.values
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 3.9690e+02, 4.9800e+00,
        2.4000e+01],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 3.9690e+02, 9.1400e+00,
        2.1600e+01],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 3.9283e+02, 4.0300e+00,
        3.4700e+01],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 3.9690e+02, 5.6400e+00,
        2.3900e+01],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 3.9345e+02, 6.4800e+00,
        2.2000e+01],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 3.9690e+02, 7.8800e+00,
        1.1900e+01]])

强制转化类型

df.iloc[:,1].astype("int")
0      18
1       0
2       0
3       0
4       0
       ..
501     0
502     0
503     0
504     0
505     0
Name: ZN, Length: 506, dtype: int32

相关度分析

df.corr()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
CRIM1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
ZN-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
INDUS0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
CHAS-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
NOX0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
RM-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
AGE0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
DIS-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
RAD0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
TAX0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
PTRATIO0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
B-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
LSTAT0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
target-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000

eda画图分析

import seaborn as sns
from matplotlib import pyplot as plt
plt.figure(dpi = 100)
plt.title('target')
sns.distplot(df['target'])

请添加图片描述

### 查看一个列的唯一值,以RAD列为例
df['RAD'].unique()
array([ 1.,  2.,  3.,  5.,  4.,  8.,  6.,  7., 24.])
# 查看标签分布情况
sns.countplot(df['RAD'])

请添加图片描述

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值