贵州贵阳数据准备和特征工程---特征工程

最新推荐文章于 2024-07-27 17:06:15 发布

权馍馍程序员

最新推荐文章于 2024-07-27 17:06:15 发布

阅读量394

点赞数 9

文章标签：人工智能 python 开发语言

本文链接：https://blog.csdn.net/m0_64745373/article/details/135317151

版权

①学校的课本的介绍：

免费的chatgpt：Ai Loadinghttps://vip7.1ai.ink/chat

1. 是可以一直使用3.5的

②学校特征工程项目的要求：

数据理解与清洗（30分）：
    数据质量评估和数据清洗方法（10分）
    缺失值、异常值处理（10分）
    对数据的基本统计分析和可视化（10分）
特征选择（20分）
    方差分析和特征过滤（10分）
    特征重要性评估和选择（10分）
特征降维（10分）
    使用PCA或LDA对数据进行降维（10分）
模型效果（20分）
    特征工程对模型性能提升的评估（10分）
    交叉验证和模型验证（10分）
文档与答辩（20分）
    答辩文档的清晰度和完整性（10分）
    对特征工程方法的解释和说明（10分）
一个人一组，不允许组队

③我的特征工程介绍：

1.绘制报纸的分布

2.查看电视中的异常的数据：

3.绘制降维后的数据

④代码介绍：有两个文档

📎Untitled Folder(1).zip

📎期末大作业.zip

1.Untitled Folder文档里面的代码：

import pandas as pd
df=pd.read_excel('./广告收益数据.xlsx')
import seaborn as sns
df.head()
# 数据质量评估
print(df.head())  # 查看数据集的前几行
print(df.info())  # 查看数据类型和缺失值情况
#数据理解与清洗
df.duplicated().sum()#查看有没有重复数据
df.drop_duplicates()#删除重复数据

#删除空值
df.dropna(inplace=True)
#去除重复值
df.drop_duplicates(inplace=True)
df.isna().sum()#查看有没有缺失数据
df.isna().sum() / df.count()#查看缺失数据占比
#df.dropna()#删除nan(空数据)
#修改数据类型
df['收益']=df['收益'].astype(int)
#保存清洗后的数据
df.to_excel('clean_df.xlsx',index=False)
#提取特征变量和目标变量
X=df.drop(columns=['报纸'])
y=df['收益']
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 6))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.5f')#绘制特征之间的相关性热图
plt.title('Correlation Heatmap')
plt.show()
# 1.绘制报纸的分布
df['报纸'].hist()
#查看电视中的异常数据
sns.boxplot(data=df['电视'])
#查看报纸中的异常数据
sns.boxplot(data=df['报纸'])
#查看广播中的异常数据
sns.boxplot(data=df['广播'])
result=df.quantile([0.25,0.75],axis=0)
IQR=result.iloc[1]['报纸']-result.iloc[0]['报纸']
top=result.iloc[1]['报纸']-1.5*IQR
top
bottom=result.iloc[0]['报纸']-1.5*IQR
bottom
(df['报纸']>top)|(df['报纸']<bottom)
df.loc[(df['报纸']>top)|(df['报纸']<bottom)]
#对除了收益列的DataFrame df的每一列进行单因素方差分析并打印ANOVA结果：特征名称，F统计量和p值。
from scipy.stats import f_oneway
groups = df['收益'].unique()

for feature in df.columns:
    if feature != '收益':  # 排除分组列
        data_by_group = [df[feature][df['收益'] == group] for group in groups]
        f_statistic, p_value = f_oneway(*data_by_group)
        print(f"ANOVA for {feature}: F-statistic={f_statistic}, p-value={p_value}")
#循环特征选择
df=df.replace({'报纸':{'低':0,'中':1,'高':2}})
X=df.drop(columns='收益')
y=df['收益']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=123)
model=DecisionTreeClassifier(max_depth=3,random_state=123)
model.fit(X_train,y_train)
model.score(X_test,y_test)
from mlxtend.feature_selection import SequentialFeatureSelector
sfs=SequentialFeatureSelector(estimator=DecisionTreeClassifier(max_depth=3,random_state=123),
            k_features=3,scoring='accuracy')
sfs.fit(X_train,y_train)
sfs.subsets_
X_new=df[['电视','广播','报纸']]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=123)
model=DecisionTreeClassifier(max_depth=3,random_state=123)
model.fit(X_train,y_train)
model.score(X_test,y_test)
sfs=SequentialFeatureSelector(estimator=DecisionTreeClassifier(max_depth=3,random_state=123),
            k_features=3,scoring='accuracy')
sfs.fit(X_train,y_train)
sfs.subsets_
from lightgbm import LGBMRegressor
model=LGBMRegressor()
model.fit(X_train,y_train)
y_pred =model.predict(X_test)
a=pd.DataFrame()#创建一个空的DataFrame
a['预测值']=list(y_pred)
a['实际值']=list(y_test)
a.head()
# 对数据的基本统计分析
print(df.describe())# 查看描述性统计信息
from sklearn.model_selection import train_test_split,KFold, cross_val_score
from sklearn.linear_model import LinearRegression , Ridge , Lasso, RidgeCV
# 数据集划分
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)
print('X_train size',X_train.shape)
print('y_train size',X_train.shape)
print('X_test size',X_test.shape)
print('y_test size',y_test.shape)
#初始化模型
L = LinearRegression()
L.fit(X_train, y_train)
# 训练集上的得分
L.score(X_train, y_train)
# 测试集上的得分
L.score(X_test, y_test)
# 对线性回归模型进行交叉验证
scores = cross_val_score(L,X, y, cv=6, scoring='r2')

print("Cross-Validation Scores:\n", scores)
print("Mean R^2 Score:", scores.mean())
#PCA降维
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 创建PCA对象，指定降维后的维度
pca = PCA(n_components=2)

# 在训练集上进行PCA拟合和转换
X_pca = pca.fit_transform(X)

# 绘制降维后的数据
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of IRIS dataset')
plt.colorbar(label='Target')
plt.show()

权馍馍程序员

关注

9
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
贵州贵阳数据准备和特征工程---特征工程

数据理解与清洗（30分）：数据质量评估和数据清洗方法（10分）缺失值、异常值处理（10分）对数据的基本统计分析和可视化（10分）特征选择（20分）方差分析和特征过滤（10分）特征重要性评估和选择（10分）特征降维（10分）使用PCA或LDA对数据进行降维（10分）模型效果（20分）特征工程对模型性能提升的评估（10分）交叉验证和模型验证（10分）文档与答辩（20分）答辩文档的清晰度和完整性（10分）对
复制链接

扫一扫