pandas学习笔记
以sklearn自带的boston数据集转为dataframe为例(这样就不用总是换示例数据了🤔),记录一下常用的pandas操作,包括查找、删减、列数据操作、eda绘图等等。
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
import pandas as pd
dataset = datasets.load_boston()
print("数据集包含的信息项:")
print(" ".join(dataset.keys()))
data = dataset["data"]
target = dataset["target"]
df = pd.DataFrame(data, columns=dataset["feature_names"])
df["target"] = target
数据集包含的信息项:
data target feature_names DESCR filename
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 target 506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB
df.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
---|
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
---|
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
---|
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
---|
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
---|
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
---|
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
---|
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
---|
df.sample(3)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
500 | 0.22438 | 0.0 | 9.69 | 0.0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6.0 | 391.0 | 19.2 | 396.90 | 14.33 | 16.8 |
---|
159 | 1.42502 | 0.0 | 19.58 | 0.0 | 0.871 | 6.510 | 100.0 | 1.7659 | 5.0 | 403.0 | 14.7 | 364.31 | 7.39 | 23.3 |
---|
243 | 0.12757 | 30.0 | 4.93 | 0.0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6.0 | 300.0 | 16.6 | 374.71 | 5.19 | 23.7 |
---|
df.tail(3)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
---|
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
---|
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
---|
df.head(3)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
---|
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
---|
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
---|
1.定位到行、列
1.1直接索引
df[3:4]
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
---|
df['B']
0 396.90
1 396.90
2 392.83
3 394.63
4 396.90
...
501 391.99
502 396.90
503 396.90
504 393.45
505 396.90
Name: B, Length: 506, dtype: float64
df[2:5][["B","ZN"]]
| B | ZN |
---|
2 | 392.83 | 0.0 |
---|
3 | 394.63 | 0.0 |
---|
4 | 396.90 | 0.0 |
---|
经过上述例子,上面一种索引方式有很多达不到的功能,比如要输出4到8列就要把这几列的字段名输入进去,比较麻烦;要输出不连续的行列也没法实现。这时候就有了功能更完善的loc和iloc
1.2 用loc和iloc,其中loc是用字段定位,iloc是用数字索引定位
注意两个函数都是用一个中括号,而不是圆括号or两个中括号,行和列的区别是逗号
df.loc[2:5,['CRIM','ZN','INDUS']]
| CRIM | ZN | INDUS |
---|
2 | 0.02729 | 0.0 | 7.07 |
---|
3 | 0.03237 | 0.0 | 2.18 |
---|
4 | 0.06905 | 0.0 | 2.18 |
---|
5 | 0.02985 | 0.0 | 2.18 |
---|
df.iloc[2:5,:3]
| CRIM | ZN | INDUS |
---|
2 | 0.02729 | 0.0 | 7.07 |
---|
3 | 0.03237 | 0.0 | 2.18 |
---|
4 | 0.06905 | 0.0 | 2.18 |
---|
print('示例一:df.iloc[3,4]。返回的为该位置的值')
print(df.iloc[3,4])
print(type(df.iloc[3,4]))
print('示例二:df.iloc[3:5,4]。返回的是一个series')
print(df.iloc[3:5,4])
print(type(df.iloc[3:5,4]))
示例一:df.iloc[3,4]。返回的为该位置的值
0.458
<class 'numpy.float64'>
示例二:df.iloc[3:5,4]。返回的是一个series
3 0.458
4 0.458
Name: NOX, dtype: float64
<class 'pandas.core.series.Series'>
df.iloc[[4,5,6,7,8,10],[1,3,5]]
| ZN | CHAS | RM |
---|
4 | 0.0 | 0.0 | 7.147 |
---|
5 | 0.0 | 0.0 | 6.430 |
---|
6 | 12.5 | 0.0 | 6.012 |
---|
7 | 12.5 | 0.0 | 6.172 |
---|
8 | 12.5 | 0.0 | 5.631 |
---|
10 | 12.5 | 0.0 | 6.377 |
---|
1.3 按多个条件选择列
df.loc[(df['target']<10) & (df['target']>5)]
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
384 | 20.08490 | 0.0 | 18.10 | 0.0 | 0.700 | 4.368 | 91.2 | 1.4395 | 24.0 | 666.0 | 20.2 | 285.83 | 30.63 | 8.8 |
---|
385 | 16.81180 | 0.0 | 18.10 | 0.0 | 0.700 | 5.277 | 98.1 | 1.4261 | 24.0 | 666.0 | 20.2 | 396.90 | 30.81 | 7.2 |
---|
387 | 22.59710 | 0.0 | 18.10 | 0.0 | 0.700 | 5.000 | 89.5 | 1.5184 | 24.0 | 666.0 | 20.2 | 396.90 | 31.99 | 7.4 |
---|
392 | 11.57790 | 0.0 | 18.10 | 0.0 | 0.700 | 5.036 | 97.0 | 1.7700 | 24.0 | 666.0 | 20.2 | 396.90 | 25.68 | 9.7 |
---|
397 | 7.67202 | 0.0 | 18.10 | 0.0 | 0.693 | 5.747 | 98.9 | 1.6334 | 24.0 | 666.0 | 20.2 | 393.10 | 19.92 | 8.5 |
---|
399 | 9.91655 | 0.0 | 18.10 | 0.0 | 0.693 | 5.852 | 77.8 | 1.5004 | 24.0 | 666.0 | 20.2 | 338.16 | 29.97 | 6.3 |
---|
400 | 25.04610 | 0.0 | 18.10 | 0.0 | 0.693 | 5.987 | 100.0 | 1.5888 | 24.0 | 666.0 | 20.2 | 396.90 | 26.77 | 5.6 |
---|
401 | 14.23620 | 0.0 | 18.10 | 0.0 | 0.693 | 6.343 | 100.0 | 1.5741 | 24.0 | 666.0 | 20.2 | 396.90 | 20.32 | 7.2 |
---|
403 | 24.80170 | 0.0 | 18.10 | 0.0 | 0.693 | 5.349 | 96.0 | 1.7028 | 24.0 | 666.0 | 20.2 | 396.90 | 19.77 | 8.3 |
---|
404 | 41.52920 | 0.0 | 18.10 | 0.0 | 0.693 | 5.531 | 85.4 | 1.6074 | 24.0 | 666.0 | 20.2 | 329.46 | 27.38 | 8.5 |
---|
414 | 45.74610 | 0.0 | 18.10 | 0.0 | 0.693 | 4.519 | 100.0 | 1.6582 | 24.0 | 666.0 | 20.2 | 88.27 | 36.98 | 7.0 |
---|
415 | 18.08460 | 0.0 | 18.10 | 0.0 | 0.679 | 6.434 | 100.0 | 1.8347 | 24.0 | 666.0 | 20.2 | 27.25 | 29.05 | 7.2 |
---|
416 | 10.83420 | 0.0 | 18.10 | 0.0 | 0.679 | 6.782 | 90.8 | 1.8195 | 24.0 | 666.0 | 20.2 | 21.57 | 25.79 | 7.5 |
---|
418 | 73.53410 | 0.0 | 18.10 | 0.0 | 0.679 | 5.957 | 100.0 | 1.8026 | 24.0 | 666.0 | 20.2 | 16.45 | 20.62 | 8.8 |
---|
419 | 11.81230 | 0.0 | 18.10 | 0.0 | 0.718 | 6.824 | 76.5 | 1.7940 | 24.0 | 666.0 | 20.2 | 48.45 | 22.74 | 8.4 |
---|
425 | 15.86030 | 0.0 | 18.10 | 0.0 | 0.679 | 5.896 | 95.4 | 1.9096 | 24.0 | 666.0 | 20.2 | 7.68 | 24.39 | 8.3 |
---|
429 | 9.33889 | 0.0 | 18.10 | 0.0 | 0.679 | 6.380 | 95.6 | 1.9682 | 24.0 | 666.0 | 20.2 | 60.72 | 24.08 | 9.5 |
---|
436 | 14.42080 | 0.0 | 18.10 | 0.0 | 0.740 | 6.461 | 93.3 | 2.0026 | 24.0 | 666.0 | 20.2 | 27.49 | 18.05 | 9.6 |
---|
437 | 15.17720 | 0.0 | 18.10 | 0.0 | 0.740 | 6.152 | 100.0 | 1.9142 | 24.0 | 666.0 | 20.2 | 9.32 | 26.45 | 8.7 |
---|
438 | 13.67810 | 0.0 | 18.10 | 0.0 | 0.740 | 5.935 | 87.9 | 1.8206 | 24.0 | 666.0 | 20.2 | 68.95 | 34.02 | 8.4 |
---|
489 | 0.18337 | 0.0 | 27.74 | 0.0 | 0.609 | 5.414 | 98.3 | 1.7554 | 4.0 | 711.0 | 20.1 | 344.05 | 23.97 | 7.0 |
---|
490 | 0.20746 | 0.0 | 27.74 | 0.0 | 0.609 | 5.093 | 98.0 | 1.8226 | 4.0 | 711.0 | 20.1 | 318.43 | 29.68 | 8.1 |
---|
df.loc[(df['target']<10) | (df['target']>5)]
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
---|
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
---|
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
---|
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
---|
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
---|
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
---|
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
---|
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
---|
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
---|
506 rows × 14 columns
空值处理:展示与填充
df.isna().sum()
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
target 0
dtype: int64
df.fillna(0)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
---|
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
---|
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
---|
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
---|
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
---|
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
---|
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
---|
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
---|
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
---|
506 rows × 14 columns
df.loc[:,"B"].fillna(0)
0 396.90
1 396.90
2 392.83
3 394.63
4 396.90
...
501 391.99
502 396.90
503 396.90
504 393.45
505 396.90
Name: B, Length: 506, dtype: float64
获取行列属性
print(df.columns.values)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT' 'target']
list(df)
['CRIM',
'ZN',
'INDUS',
'CHAS',
'NOX',
'RM',
'AGE',
'DIS',
'RAD',
'TAX',
'PTRATIO',
'B',
'LSTAT',
'target']
df.columns[0]
'CRIM'
print(df.shape[0])
print(df.shape[1])
506
14
新增与删除
注:用于删除的drop函数并不会直接替代原来的df文件,要替换的话可以重新赋值或者将inplace改成true
删除都要用axis选定轴
df["测试列"] = 0
df
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | 测试列 |
---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 | 0 |
---|
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 | 0 |
---|
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 | 0 |
---|
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 | 0 |
---|
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 | 0 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 | 0 |
---|
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 | 0 |
---|
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 | 0 |
---|
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 | 0 |
---|
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 | 0 |
---|
506 rows × 15 columns
df.drop(["测试列"],axis = 1,inplace=True)
df
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
---|
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
---|
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
---|
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
---|
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
---|
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
---|
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
---|
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
---|
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
---|
506 rows × 14 columns
df['target'].apply(lambda x:x+10)
0 34.0
1 31.6
2 44.7
3 43.4
4 46.2
...
501 32.4
502 30.6
503 33.9
504 32.0
505 21.9
Name: target, Length: 506, dtype: float64
list = [1,3,5,6]
df.drop(df.columns[list],axis =1)
| CRIM | INDUS | NOX | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
0 | 0.00632 | 2.31 | 0.538 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
---|
1 | 0.02731 | 7.07 | 0.469 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
---|
2 | 0.02729 | 7.07 | 0.469 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
---|
3 | 0.03237 | 2.18 | 0.458 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
---|
4 | 0.06905 | 2.18 | 0.458 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
501 | 0.06263 | 11.93 | 0.573 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
---|
502 | 0.04527 | 11.93 | 0.573 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
---|
503 | 0.06076 | 11.93 | 0.573 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
---|
504 | 0.10959 | 11.93 | 0.573 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
---|
505 | 0.04741 | 11.93 | 0.573 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
---|
506 rows × 10 columns
文件导出与处理
其他应用
返回pandas列数据对应的数组
np.array(df)
df.values
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 3.9690e+02, 4.9800e+00,
2.4000e+01],
[2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 3.9690e+02, 9.1400e+00,
2.1600e+01],
[2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 3.9283e+02, 4.0300e+00,
3.4700e+01],
...,
[6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 3.9690e+02, 5.6400e+00,
2.3900e+01],
[1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 3.9345e+02, 6.4800e+00,
2.2000e+01],
[4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 3.9690e+02, 7.8800e+00,
1.1900e+01]])
强制转化类型
df.iloc[:,1].astype("int")
0 18
1 0
2 0
3 0
4 0
..
501 0
502 0
503 0
504 0
505 0
Name: ZN, Length: 506, dtype: int32
相关度分析
df.corr()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target |
---|
CRIM | 1.000000 | -0.200469 | 0.406583 | -0.055892 | 0.420972 | -0.219247 | 0.352734 | -0.379670 | 0.625505 | 0.582764 | 0.289946 | -0.385064 | 0.455621 | -0.388305 |
---|
ZN | -0.200469 | 1.000000 | -0.533828 | -0.042697 | -0.516604 | 0.311991 | -0.569537 | 0.664408 | -0.311948 | -0.314563 | -0.391679 | 0.175520 | -0.412995 | 0.360445 |
---|
INDUS | 0.406583 | -0.533828 | 1.000000 | 0.062938 | 0.763651 | -0.391676 | 0.644779 | -0.708027 | 0.595129 | 0.720760 | 0.383248 | -0.356977 | 0.603800 | -0.483725 |
---|
CHAS | -0.055892 | -0.042697 | 0.062938 | 1.000000 | 0.091203 | 0.091251 | 0.086518 | -0.099176 | -0.007368 | -0.035587 | -0.121515 | 0.048788 | -0.053929 | 0.175260 |
---|
NOX | 0.420972 | -0.516604 | 0.763651 | 0.091203 | 1.000000 | -0.302188 | 0.731470 | -0.769230 | 0.611441 | 0.668023 | 0.188933 | -0.380051 | 0.590879 | -0.427321 |
---|
RM | -0.219247 | 0.311991 | -0.391676 | 0.091251 | -0.302188 | 1.000000 | -0.240265 | 0.205246 | -0.209847 | -0.292048 | -0.355501 | 0.128069 | -0.613808 | 0.695360 |
---|
AGE | 0.352734 | -0.569537 | 0.644779 | 0.086518 | 0.731470 | -0.240265 | 1.000000 | -0.747881 | 0.456022 | 0.506456 | 0.261515 | -0.273534 | 0.602339 | -0.376955 |
---|
DIS | -0.379670 | 0.664408 | -0.708027 | -0.099176 | -0.769230 | 0.205246 | -0.747881 | 1.000000 | -0.494588 | -0.534432 | -0.232471 | 0.291512 | -0.496996 | 0.249929 |
---|
RAD | 0.625505 | -0.311948 | 0.595129 | -0.007368 | 0.611441 | -0.209847 | 0.456022 | -0.494588 | 1.000000 | 0.910228 | 0.464741 | -0.444413 | 0.488676 | -0.381626 |
---|
TAX | 0.582764 | -0.314563 | 0.720760 | -0.035587 | 0.668023 | -0.292048 | 0.506456 | -0.534432 | 0.910228 | 1.000000 | 0.460853 | -0.441808 | 0.543993 | -0.468536 |
---|
PTRATIO | 0.289946 | -0.391679 | 0.383248 | -0.121515 | 0.188933 | -0.355501 | 0.261515 | -0.232471 | 0.464741 | 0.460853 | 1.000000 | -0.177383 | 0.374044 | -0.507787 |
---|
B | -0.385064 | 0.175520 | -0.356977 | 0.048788 | -0.380051 | 0.128069 | -0.273534 | 0.291512 | -0.444413 | -0.441808 | -0.177383 | 1.000000 | -0.366087 | 0.333461 |
---|
LSTAT | 0.455621 | -0.412995 | 0.603800 | -0.053929 | 0.590879 | -0.613808 | 0.602339 | -0.496996 | 0.488676 | 0.543993 | 0.374044 | -0.366087 | 1.000000 | -0.737663 |
---|
target | -0.388305 | 0.360445 | -0.483725 | 0.175260 | -0.427321 | 0.695360 | -0.376955 | 0.249929 | -0.381626 | -0.468536 | -0.507787 | 0.333461 | -0.737663 | 1.000000 |
---|
eda画图分析
import seaborn as sns
from matplotlib import pyplot as plt
plt.figure(dpi = 100)
plt.title('target')
sns.distplot(df['target'])
![请添加图片描述](https://img-blog.csdnimg.cn/5163501dde934e5bbe2fd7779a3f1ae5.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5rWq5a2Q6IaP5Li3,size_18,color_FFFFFF,t_70,g_se,x_16)
df['RAD'].unique()
array([ 1., 2., 3., 5., 4., 8., 6., 7., 24.])
sns.countplot(df['RAD'])
![请添加图片描述](https://img-blog.csdnimg.cn/e1f48e37fe6047dd8ca4f0f4754fefe2.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5rWq5a2Q6IaP5Li3,size_13,color_FFFFFF,t_70,g_se,x_16)