- import the data
- clean the data
- split the data into Training/Test Sets
- Create a Model
- Train the Model
- Make Predictions
- Evaluate and Improve
下载 anaconda
import pandas as pd
df = pd.read_csv('vgsales.csv')
df
import pandas as pd
df = pd.read_csv('vgsales.csv')
df.shape
df.describe()
import pandas as pd
music_data = pd.read_csv('music.csv')
music_data
age gender 为 input set (输入集)
genre 为 output(输出集)
import pandas as pd
# 决策树
from sklearn.tree import DecisionTreeClassifier
music_data = pd.read_csv('music.csv')
# X为输入集
X = music_data.drop(columns=['genre'])
y = music_data['genre']
# 训练
model = DecisionTreeClassifier()
model.fit(X,y)
# 预测 21岁的男人和22岁的女人喜欢什么
predictions = model.predict([[21,1],[22,0]])
predictions
测量精确度
import pandas as pd
# 决策树
from sklearn.tree import DecisionTreeClassifier
# 数据分割工具
from sklearn.model_selection import train_test_split
# 精度分数
from sklearn.metrics import accuracy_score
music_data = pd.read_csv('music.csv')
# X为输入集 除了 genre 这一行
X = music_data.drop(columns=['genre'])
y = music_data['genre']
# 分割数据 输入集的训练数据占比80% 测试占比20%
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)
# 训练
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
# 预测占比20%的输入集的结果为
predictions = model.predict(X_test)
score = accuracy_score(y_test,predictions)
score
多点几次运行会得到不一样的结果,因为每次训练的时候数据拆分是随机的,所以结果有差异性
用的数据越庞大,训练次数越多,准确度越高
例如我这里将测试占比调高为0.8
得到一个非常糟糕的结果,简称人工智障
训练模型持久化:不能每次判断用户兴趣爱好的时候就启动一次训练模型,太浪费时间,实际环境中有上千万条数据。将训练好的模型保存起来,直接使用
import pandas as pd
# 决策树
from sklearn.tree import DecisionTreeClassifier
# 持久化工具
from sklearn.externals import joblib
# music_data = pd.read_csv('music.csv')
# # X为输入集 除了 genre 这一行
# X = music_data.drop(columns=['genre'])
# y = music_data['genre']
# # 分割数据 输入集的训练数据占比80% 测试占比20%
# X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)
# # 训练
# model = DecisionTreeClassifier()
# model.fit(X_train,y_train)
# 持久化操作
# joblib.dump(model,'music-recommender.joblib')
# 加载训练模型
model = joblib.load('music-recommender.joblib')
# 预测结果
predictions = model.predict([[21,1]])
predictions
可视化模型