Python Pandas & sklearn 库的使用

最新推荐文章于 2024-08-12 17:58:12 发布

我有两颗糖

最新推荐文章于 2024-08-12 17:58:12 发布

阅读量750

点赞数

分类专栏：机器学习文章标签： python 机器学习数据分析

本文链接：https://blog.csdn.net/qq_41140138/article/details/118549628

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1. 创建数据表

pandas 定义了DataFrame 数据表对象，初始化 DataFram 的方法为：

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

其中，data 为数据表对数据，index 为行标，columns 为列名，都为 list 类型。

如创建一个学生成绩表：

import pandas as pd
import numpy as np

scores = np.random.randint(60, 100, (8, 3))
index = np.arange(8)
columns = ['Chinese', 'Math', 'English']
df = pd.DataFrame(data=scores, index=index, columns=columns)
print(df)	# 打印数据表

输出结果如下：

   Chinese  Math  English
0       73    93       84
1       69    92       78
2       78    71       72
3       98    92       97
4       81    67       82
5       88    80       65
6       97    88       73
7       88    96       65

2. 访问数据表

可以查看列名、查看部分表数据、排序、条件选择等：

# 查看列名
print(df.columns)

Index(['Chinese', 'Math', 'English'], dtype='object')


# 查看前三行
print(df.head(3))

   Chinese  Math  English
0       73    93       84
1       69    92       78
2       78    71       72


# 按特定的列进行排序
sorted_df = df.sort_values(by='English')
print(sorted_df)

   Chinese  Math  English
5       88    80       65
7       88    96       65
2       78    71       72
6       97    88       73
1       69    92       78
4       81    67       82
0       73    93       84
3       98    92       97


# 选择一列和切片
print(df[0:3]['Math'])

0    93
1    92
2    71
Name: Math, dtype: int32


# 根据位置选择部分表
print(df.iloc[0:3, [0, 1]])
   Chinese  Math
0       73    93
1       69    92
2       78    71


# 条件选择
print(df[df['Math'] > 80])

   Chinese  Math  English
0       73    93       84
1       69    92       78
3       98    92       97
6       97    88       73
7       88    96       65

3. sklearn.datasets

scikit-learn 中存放了很多数据集，比如我们查看鸢尾花数据集：

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = [
	'sepal length',
	'sepal width',
	'petal length',
	'petal width',
	'label'
]
print(df)

其中，target 为分类的标记，我们将它添加到最后一列，输出结果为：

     sepal length  sepal width  petal length  petal width  label
0             5.1          3.5           1.4          0.2      0
1             4.9          3.0           1.4          0.2      0
2             4.7          3.2           1.3          0.2      0
3             4.6          3.1           1.5          0.2      0
4             5.0          3.6           1.4          0.2      0
..            ...          ...           ...          ...    ...
145           6.7          3.0           5.2          2.3      2
146           6.3          2.5           5.0          1.9      2
147           6.5          3.0           5.2          2.0      2
148           6.2          3.4           5.4          2.3      2
149           5.9          3.0           5.1          1.8      2

[150 rows x 5 columns]