算法工程师9——机器学习概述（下篇-算法进阶）

最新推荐文章于 2024-08-04 16:04:20 发布

晓码bigdata

最新推荐文章于 2024-08-04 16:04:20 发布

阅读量266

点赞数 1

分类专栏：计算机视觉算法工程师文章标签： python 机器学习

本文链接：https://blog.csdn.net/xiaotiig/article/details/115034277

版权

计算机视觉算法工程师专栏收录该内容

24 篇文章 10 订阅

订阅专栏

算法

9 朴素贝叶斯
10 支持向量机
11 EM算法
12 马尔科夫链
- 12.1 隐马尔科夫模型
13 集成学习进阶
- XGboost
- lightGBM
必会代码

9 朴素贝叶斯

利用概率进行求解，一般认为特征之间是互不相干的，即条件独立
在这里插入图片描述

10 支持向量机

在这里插入图片描述

svm的推导必须会

在这里插入图片描述

11 EM算法

在这里插入图片描述

12 马尔科夫链

无记忆，只和前一项有关，比如今天天晴，明天下雨的概率和昨天无关
在这里插入图片描述

12.1 隐马尔科夫模型

隐马尔可夫模型（Hidden Markov Model，HMM）是统计模型，它用来描述一个含有隐含未知参数的马尔可夫过程。
其难点是从可观察的参数中确定该过程的隐含参数。然后利用这些参数来作进一步的分析，例如模式识别。

根本要知道几个概念，这样就能理解了
根本要知道几个概念，这样就能理解了
根本要知道几个概念，这样就能理解了

在这里插入图片描述
图概念

在这里插入图片描述

第一个问题就是求“图概念”中的第5个
第二个问题就是求“图概念”中的第1个的结果的概率
第三个问题就是求“图概念”中的第4个，隐藏状态序列

13 集成学习进阶

在这里插入图片描述

XGboost

lightGBM

https://zhuanlan.zhihu.com/p/61842339

https://blog.csdn.net/u010366748/article/details/113816465

必会代码

11 贝叶斯文本情感分类

import pandas as pd
import numpy as np
import jieba
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

#  1获取数据

data = pd.read_csv("./data/书籍评价.csv", encoding="gbk")
print(data)

# 2.1） 取出内容列，对数据进行分析
content = data["内容"]
content.head()

# 2.2） 判定评判标准 -- 1好评;0差评

# 注意，本例后面没有用到1，0，直接用的好评，差评
data.loc[data.loc[:, '评价'] == "好评", "评论标号"] = 1  # 把好评修改为1
data.loc[data.loc[:, '评价'] == '差评', '评论标号'] = 0

# data.head()
good_or_bad = data['评价'].values  # 获取数据
print(good_or_bad)
# ['好评' '好评' '好评' '好评' '差评' '差评' '差评' '差评' '差评' '好评' '差评' '差评' '差评']

# 2.3） 选择停用词
# 加载停用词
stopwords=[]
with open('./data/stopwords.txt','r',encoding='utf-8') as f:
    lines=f.readlines()
    print(lines)
    for tmp in lines:
        line=tmp.strip()  # strip()去除头尾的空格和换行符
        print(line)
        stopwords.append(line)
# stopwords  # 查看新产生列表

#对停用词表进行去重
stopwords=list(set(stopwords))#去重  列表形式
print(stopwords)

# 2.4） 把“内容”处理，转化成标准格式
comment_list = []
for tmp in content:
    print(tmp)
    # 对文本数据进行切割
    # cut_all 参数默认为 False,所有使用 cut 方法时默认为精确模式
    seg_list = jieba.cut(tmp, cut_all=False)
    print(seg_list)  # <generator object Tokenizer.cut at 0x0000000007CF7DB0>
    seg_str = ','.join(seg_list)  # 拼接字符串
    print(seg_str)
    comment_list.append(seg_str)  # 目的是转化成列表形式
# print(comment_list)  # 查看comment_list列表。

# 2.5） 统计词的个数
# 进行统计词个数
# 实例化对象
# CountVectorizer 类会将文本中的词语转换为词频矩阵
con = CountVectorizer(stop_words=stopwords)
# 进行词数统计
X = con.fit_transform(comment_list)  # 它通过 fit_transform 函数计算各个词语出现的次数
name = con.get_feature_names()  # 通过 get_feature_names()可获取词袋中所有文本的关键字
print(X.toarray())  # 通过 toarray()可看到词频矩阵的结果
print(name)

# 2.6）准备训练集和测试集
# 准备训练集   这里将文本前10行当做训练集  后3行当做测试集

# 这里为什么要用toarray()
x_train = X.toarray()[:10, :]
y_train = good_or_bad[:10]
# 准备测试集
x_text = X.toarray()[10:, :]
y_text = good_or_bad[10:]

# 构建贝叶斯算法分类器
mb = MultinomialNB(alpha=1)  # alpha 为可选项，默认 1.0，添加拉普拉修/Lidstone 平滑参数
# 训练数据
mb.fit(x_train, y_train)
# 预测数据
y_predict = mb.predict(x_text)
#预测值与真实值展示
print('预测值：',y_predict)
print('真实值：',y_text)

score = mb.score(x_text, y_text)
print(score)

12 SVM进行手写数字识别

# 使用svm实现手写数字识别
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split
import time
from sklearn.decomposition import PCA


# 1 获取数据
train = pd.read_csv(r"H:\05学习资料\14，软件开发\黑马人工智能\2课件\阶段3-人工智能机器学习\阶段3-人工智能机器学习\02_机器学习算法day10\02_机器学习算法day10\02-代码\data\train.csv")
print(train)

# 1.1确定特征值目标值
train_image = train.iloc[:,1:]
print(train_image)

train_label = train.iloc[:,0]
print("标签值：")
print(train_label)

# 1.2查看具体值
print(train_image.iloc[2,:])

# 1.3查看具体图像
def to_plot(n):
    values = train_image.iloc[n,].values
    values = values.reshape(28, 28)
    plt.imshow(values)
    plt.axis("off")
    plt.show()

# to_plot(9)

# 1.4数据进行归一化
train_image = train_image.values/255

# 1.5数据集分割
x_train, x_val, y_train, y_val = train_test_split(train_image,train_label,test_size=0.2,random_state=0)
print(x_train.shape,x_val.shape)

# 多次使用pca,确定最优的模型参数
def n_components_analysis(n,x_train,y_train,x_val,y_val):
    # 记录开始时间
    start = time.time()

    # pca降维
    pca = PCA(n_components=n)
    print("特征降维参数:{}".format(n))
    pca.fit(x_train)

    # 在训练集和测试集上进行降维
    x_train_pca = pca.transform(x_train)
    x_val_pca = pca.transform(x_val)
    print("开始使用svc进行训练：")
    ss = svm.SVC()
    ss.fit(x_train_pca,y_train)

    # 获取accuracy结果
    accuracy = ss.score(x_val_pca,y_val)
    print("准确率：", accuracy)
    # 记录时间
    end = time.time()
    use_time = end-start
    print("用时：",use_time)
    return accuracy

n_s = np.linspace(0.7,0.85,num=5)
accuracy = []
for n in n_s:
    tem = n_components_analysis(n,x_train,y_train,x_val,y_val)
    accuracy.append(tem)

plt.plot(n_s,np.array(accuracy),"r")
plt.show()



# 4 确定最优模型
pca = PCA(n_components=0.8)
pca.fit(x_train)
print(pca.n_components_)

x_train_pca = pca.transform(x_train)
x_val_pca = pca.transform(x_val)

# 训练比较优的模型
ss1 = svm.SVC()
ss1.fit(x_train_pca,y_train)
score = ss1.score(x_val_pca,y_val)
print("最终的准确率：",score)

13 隐马尔科夫模型简单实现

# 隐马尔可夫模型
import numpy as np
from hmmlearn import hmm
import math


# 设定隐藏状态的集合
states = ["box 1", "box 2", "box3"]
n_states = len(states)

# 设定观察状态的集合
observations = ["red", "white"]
n_observations = len(observations)

# 设定初始状态分布
start_probability = np.array([0.2, 0.4, 0.4])

# 设定状态转移概率分布矩阵
transition_probability = np.array([
  [0.5, 0.2, 0.3],
  [0.3, 0.5, 0.2],
  [0.2, 0.3, 0.5]
])

# 设定观测状态概率矩阵
emission_probability = np.array([
  [0.5, 0.5],
  [0.4, 0.6],
  [0.7, 0.3]
])

# 设定模型参数
model = hmm.MultinomialHMM(n_components=n_states)
model.startprob_=start_probability  # 初始状态分布
model.transmat_=transition_probability  # 状态转移概率分布矩阵
model.emissionprob_=emission_probability  # 观测状态概率矩阵

# 预测结果
seen = np.array([[0,1,0]]).T  # 设定观测序列
box = model.predict(seen)

print("球的观测顺序为：\n", ", ".join(map(lambda x: observations[x], seen.flatten())))
print("看看seen是否也变化了：",seen)
# 注意：需要使用flatten方法，把seen从二维变成一维
print("最可能的隐藏状态序列为:\n", ", ".join(map(lambda x: states[x], box)))
print(box)

# 我们再来看看求HMM问题一的观测序列的概率的问题，
print(model.score(seen))
# 输出结果是：-2.03854530992

# 要注意的是score函数返回的是以自然对数为底的对数概率值，我们在HMM问题一中手动计算的结果是未取对数的原始概率是0.13022。对比一下：
print(math.exp(-2.038545309915233))
# ln0.13022≈−2.0385
# 输出结果是：0.13021800000000003)

14 lightBGM绝地求生存活概率（案例非常非常好，知识点大总结，必须会）

# 使用lightGBM实现绝地求生玩家排名预测
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import lightgbm as lgb


# 1 获取数据和预处理
train = pd.read_csv(r"H:\05学习资料\14，软件开发\黑马人工智能\2课件\阶段3-人工智能机器学习\阶段3-人工智能机器学习\02_机器学习算法day13\02_机器学习算法day13\02-代码\data\train_V2.csv")
# 1.1 查看数据特征
print("查看数据特征：")
print(train.shape)
print(train.size)
print(train.head())
print(train.tail())
#print(train.describe())
print(train.info())


print("查看有多少场比赛：")
game_count = np.unique(train["groupId"]).shape          # 去重
print(game_count)

# 1.2 缺失值处理
queshi = np.any(train.isnull())
# 输出为True，说明有确实值，经过发现winPlacePerc行有确实值
print(queshi)

# 寻找确实值的行,这步比较难理解
queshi_row = train[train["winPlacePerc"].isnull()]
print("有缺失值的行：")
print(queshi_row)

# 删除确实值
train.drop(2744604, inplace=True)

# 显示每场比赛参加人数
print("显示每场比赛参加人数")
count = train.groupby("matchId")["matchId"].transform('count')

train['playersJoined'] = count
print(count.count())

print("看看现在的数据：\n",train)

# 通过每场比赛参加人数进行升序排序
paixu = train["playersJoined"].sort_values()
print("排序：")
print(train)
print(paixu)

# 删掉人数小于75的
plt.figure(figsize=(20,8))
sns.countplot(train["playersJoined"])
plt.grid()
#plt.show()


plt.figure(figsize=(20,8))
sns.countplot(train[train["playersJoined"]>=75]["playersJoined"])
plt.grid()
#plt.show()

# 进行数据的规范化输出
train["killsNorm"] = train["kills"]*((100-train["playersJoined"])/100+1)
train["damageDealtNorm"] = train["damageDealt"]*((100-train["playersJoined"])/100+1)
train["maxPlaceNorm"] = train["maxPlace"]*((100-train["playersJoined"])/100+1)
train["matchDurationNorm"] = train["matchDuration"]*((100-train["playersJoined"])/100+1)

print(train.head())

# 部分变量合成
train["healsandboosts"] = train["heals"] + train["boosts"]
print(train[["heals","boosts","healsandboosts"]].tail())

# 注意这里的语法，下面的是错误的
#print(train["heals","boosts","healsandboosts"].tail())


# 删除有击杀，，但是完全没有移动的玩家
train["totleDistance"] = train["rideDistance"] + train["walkDistance"] + train["swimDistance"]
print("看没有移动：")
print(train.head())

train["killwithoutMoving"] = (train["kills"]>0) & (train["totleDistance"] == 0)
print(train[train["killwithoutMoving"] == True])

print("看看删除有击杀，但是完全没有移动的玩家的数量：")
print(train[train["killwithoutMoving"] == True].shape)

# 删除这些数据
train.drop(train[train["killwithoutMoving"] == True].index, inplace=True)
print("删除后：")
print(train.shape)



# 删除驾车杀敌异常数据
train.drop(train[train["roadKills"] > 10].index, inplace=True)
print("删除后：")
print(train.shape)

# 删除一局杀死超过30人的数据
print(train[train["kills"] > 30])
train.drop(train[train["kills"] > 30].index, inplace=True)
print("删除后：")
print(train.shape)

# 删除爆头率异常的数据
train["headshot_rate"] = train["headshotKills"] / train["kills"]
train["headshot_rate"] = train["headshot_rate"].fillna(0)
print(train["headshot_rate"].tail())

plt.figure(figsize=(12,4))
sns.distplot(train["headshot_rate"], bins=10, kde=False)
#plt.show()

train.drop(train[(train["headshot_rate"] ==1) & (train["kills"] > 9)].index, inplace=True)
print("删除后：")
print(train.shape)

# 删除杀敌距离大于1000的数据
train.drop(train[train["longestKill"] >= 1000].index, inplace=True)
print("杀敌距离大于1000的数据：")
print(train.shape)

# 删除运动距离异常的数据
train.drop(train[train["walkDistance"] >= 10000].index, inplace=True)
train.drop(train[train["rideDistance"] >= 20000].index, inplace=True)
train.drop(train[train["swimDistance"] >= 2000].index, inplace=True)

# 删除武器收集异常
# 下面这个查看语句，比较一下二者的区别
print(train[train["weaponsAcquired"] >= 80][["weaponsAcquired"]].head())
print("不加[]")
print(train[train["weaponsAcquired"] >= 80]["weaponsAcquired"].head())

train.drop(train[train["weaponsAcquired"] >= 80].index, inplace=True)

# 删除药品收集异常
train.drop(train[train["heals"] >= 40].index, inplace=True)
print(train.shape)


############ 下面进行类别性数据处理
print(train["matchType"].unique())
train = pd.get_dummies(train, columns=["matchType"])
print(train.head())

# 获取one——hot编码
matchType_encoding = train.filter(regex="matchType")
print(matchType_encoding.head())

# 对groupid和matchid进行处理,转换为数字类型
# 不清楚下面这句代码的含义和作用
train["groupId"] = train["groupId"].astype("category")
train["groupId_cat"] = train["groupId"].cat.codes
print("看看groupId编码后的样子：")
print(train[["groupId","groupId_cat"]])

train["matchId"] = train["matchId"].astype("category")
train["matchId_cat"] = train["matchId"].cat.codes
print("看看matchId 编码后的样子：")
print(train[["matchId","matchId_cat"]])

# 看看train现在的形状
########################和df = df_sample.drop(["winPlacePerc", "Id"], axis=1)对比，一个删，一个不删
########################和df = df_sample.drop(["winPlacePerc", "Id"], axis=1)对比，一个删，一个不删
########################和df = df_sample.drop(["winPlacePerc", "Id"], axis=1)对比，一个删，一个不删
print("看看train现在的形状")
print(train.shape)
# 删除groupid和matchid
train.drop(["groupId","matchId"], axis=1, inplace=True)
print(train.shape)


# 获取部分数据进行使用
df_sample = train.sample(100000)
print(df_sample.shape)

######## 确定特征值和目标值

df = df_sample.drop(["winPlacePerc", "Id"], axis=1)
y = df_sample["winPlacePerc"]
print("删除了以后df_sample难道不变吗？真不变")
print(df_sample.shape)

########### 分割训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(df,y,test_size=0.2)
print(x_train.shape,y_train.shape)

###   3 模型的训练
# 3.1 使用随机森林
ml = RandomForestRegressor(n_estimators=40,
                           min_samples_leaf=3,
                           max_features="sqrt",
                           n_jobs = -1)

# n_jobs=-1 表示训练的时候，并行数和cpu核数一致，如果传入具体值，表示用几个核去跑
ml.fit(x_train,y_train)

y_pre = ml.predict(x_test)
score = ml.score(x_test,y_test)
loss = mean_absolute_error(y_true=y_test,y_pred=y_pre)
print("准确率和损失值：\n",score,loss)


print("查看特征值在当前模型中的重要程度：ml.feature_importances_")

imp_df = pd.DataFrame({"cols":df.columns,"imp":ml.feature_importances_})
print(imp_df)

imp_df = imp_df.sort_values("imp", ascending=False)
print(imp_df)

# 绘个图
imp_df[:20].plot("cols","imp",figsize=(20,8),kind = "barh")

# 20个列名
to_keep = imp_df[imp_df.imp>0.005].cols

# 3.2 根据重要的信息重新构建模型
df_keep = df[to_keep]
x_train,x_test,y_train,y_test = train_test_split(df_keep,y,test_size=0.2)
print(x_train.shape)

m2 = RandomForestRegressor(n_estimators=40,
                           min_samples_leaf=3,
                           max_features="sqrt",
                           n_jobs = -1)
m2.fit(x_train,y_train)
y_pre = m2.predict(x_test)
print("m2的准确率:")
print(m2.score(x_test,y_test))
print(mean_absolute_error(y_test,y_pre))


###### 3.3 使用lightGBM进行模型训练

x_train, x_test, y_train, y_test = train_test_split(df,y,test_size=0.2)
gbm = lgb.LGBMRegressor(objective="regression",
                        num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)
gbm.fit(x_train,y_train,eval_set=[(x_test,y_test)],eval_metric="l1", early_stopping_rounds=5)
y_pre = gbm.predict(x_test, num_iteration=gbm.best_iteration_)
print(mean_absolute_error(y_test, y_pre))

####### 3.4 模型调优2
estimator = lgb.LGBMRegressor(num_leaves=31)
param_grid = {
    "learning_rate":[0.01,0.1],
    "n_estimators":[40,60,80]
}

# 这里用到了网格搜索，这个函数的作用需要好好了解一下
gbm = GridSearchCV(estimator,param_grid, cv=5, n_jobs=-1)
gbm.fit(x_train, y_train)

y_pre = gbm.predict(x_test)
print(mean_absolute_error(y_test, y_pre))

print("最优参数：")
print(gbm.best_params_)

#### 模型第二种调优方法
errors = []
n_estimators = [60,80,100,]

for nes in n_estimators:
    lgbm = lgb.LGBMRegressor(objective="regression",
                        boosting_type="gbdt",
                        num_leaves=31,
                        learning_rate=0.1,
                        n_estimators=nes,
                        min_child_samples=20,
                        n_jobs=-1,
                        max_depth=5
                        )
    lgbm.fit(x_train, y_train, eval_set=[(x_test, y_test)], eval_metric="l1", early_stopping_rounds=5)
    y_pre = gbm.predict(x_test, num_iteration=gbm.best_iteration_)
    mae = mean_absolute_error(y_test, y_pre)
    errors.append(mae)
    print("本轮训练损失值：",mae)

plt.plot(n_estimators,errors,"o-")
plt.ylabel("mae")
plt.xlabel("n_estimators")
print("beat n_estimators {}".format(n_estimators[np.argmin(errors)]))

晓码bigdata

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
算法工程师9——机器学习概述（下篇-算法进阶）

算法9 朴素贝叶斯必会代码11 贝叶斯文本情感分类9 朴素贝叶斯利用概率进行求解，一般认为特征之间是互不相干的，即条件独立必会代码11 贝叶斯文本情感分类import pandas as pdimport numpy as npimport jiebaimport matplotlib.pyplot as pltfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes impo
复制链接

扫一扫