一、机器学习基础
知识点:
- 机器学习分监督学习和非监督学习。
监督学习:有因变量、有特征向量、目的:预测
非监督学习:无因变量、有特征向量,目的:寻找数据中的结构 - 监督学习分回归和分类
回归:因变量连续
分类:因变量离散
常用的包
#引入科学计算包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#直接在你的python console里面生成图像
plt.style.use("ggplot")
import seaborn as sns
1.1 回归
from sklearn import datasets
boston = datasets.load_boston()
x = boston.data
y = boston.target
features = boston.feature_names
boston_data = pd.DataFrame(x,columns=features)
boston_data.head(2)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.9 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.9 | 9.14 |
boston_data["PRICE"] = y
boston_data.head(2)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.9 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.9 | 9.14 | 21.6 |
sns.scatterplot(boston_data['NOX'], boston_data['PRICE'], color="r",alpha=0.6)
plt.title("PRICE~NOX")
plt.show()
1.2 分类
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
features = iris.feature_names
iris_data = pd.DataFrame(x, columns=features)
iris_data['target'] = y
iris_data.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
marker = ['s','x','o']
for index,c in enumerate(np.unique(y)):
plt.scatter(x=iris_data.loc[y==c,"sepal length (cm)"], y=iris_data.loc[y==c, "sepal width (cm)"], alpha=0.8,
label=c,marker=marker[c])
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.legend()
plt.show()
1.3 非监督学习
https://scikit-learn.org/stable/modules/classes.html?highlight=datasets#module-sklearn.datasets
#生成月牙型非凸集
from sklearn import datasets
x, y = datasets.make_moons(n_samples=2000, shuffle=True,
noise=0.05, random_state=None)
for index,c in enumerate(np.unique(y)):
plt.scatter(x[y==c,0], x[y==c, 1],s=7) #s:标记大小
plt.show()
#生成符合正态分布的聚类数据
from sklearn import datasets
x, y = datasets.make_blobs(n_samples=5000,n_features=2,centers=3)
for index,c in enumerate(np.unique(y)):
plt.scatter(x[y==c, 0], x[y==c, 1], s=7)
plt.show()