机器学习作业-化妆品数据分析_化妆品数据分析图表-CSDN博客

本文链接：https://blog.csdn.net/qq_55970659/article/details/141299489

数据分析过程模型

本章旨在分析WIS美白防晒霜喷的数据，通过使用Kmeans聚类、SVM分类、逻辑回归分类和随机森林分类的方法，揭示数据中的潜在模式和特征重要性，并解释这些分析结果的意义。

1.1 数据预处理

为了确保数据适合进行分析，我们对数据进行了标准化处理。标准化有助于消除不同特征之间的量纲差异，使得模型训练更加稳定和有效。

1.2 Kmeans聚类分析

Kmeans聚类算法将数据分成5个簇，以便识别数据中的不同群体。聚类结果如下图所示：

图1 Kmeans 聚类图

图中不同颜色表示不同的簇。这表明数据可以自然地分为几个群体，每个群体可能代表具有相似特征的消费者或产品。这种分群有助于我们更好地理解数据的内部结构，并针对不同群体采取不同的营销策略。

1.3 SVM分类分析

我们使用支持向量机（SVM）模型对聚类结果进行了分类，并评估了模型的性能。

模型训练细节:

数据划分: 将数据集划分为训练集和测试集，比例为70%:30%。
标准化处理: 使用StandardScaler对数据进行标准化。
模型参数: 使用线性核函数 (kernel='linear') 进行训练。

模型评估结果如下所示。

表1. SVM模型分类结果指标表

聚类簇	precision	recall	f1-score	support
0	1.00	1.00	1.00	9
1	1.00	1.00	1.00	59
3	1.00	1.00	1.00	2
4	1.00	1.00	1.00	2

SVM模型表现非常优秀，准确率达到了100%。这表明模型能够很好地区分不同的簇，并对每个簇进行准确分类。通过提取SVM模型的特征重要性，我们可以了解哪些特征对分类结果影响最大。

1.4 逻辑回归分类分析

我们使用逻辑回归模型对数据进行了分类，并评估了模型的性能。

模型训练细节:

数据划分: 将数据集划分为训练集和测试集，比例为70%:30%。
标准化处理: 使用StandardScaler对数据进行标准化。
模型参数: 使用默认参数进行训练，并将最大迭代次数设为1000 (max_iter=1000)。

模型评估结果如下所示。

表2. 逻辑回归分类模型分类结果指标表

聚类簇	precision	recall	f1-score	support
0	0.75	0.67	0.71	9
1	0.95	0.97	0.96	59
3	1.00	1.00	1.00	2
4	1.00	1.00	1.00	2

逻辑回归模型表现也相当优秀，准确率达到了93%。虽然不如SVM模型，但仍然能够有效地进行分类。

1.5随机森林归分类分析

我们使用随机森林模型对数据进行了分类，并评估了模型的性能。

模型评估结果如下所示。

表3. 随机森林模型分类结果指标表

聚类簇	precision	recall	f1-score	support
0	1.00	1.00	9	9
1	1.00	1.00	1.00	59
3	1.00	1.00	1.00	2
4	1.00	1.00	1.00	2

随机森林模型同样表现出色，准确率达到了100%。

1.6 特征重要性图表展示

通过特征重要性图表，我们可以清晰地看到各个模型认为最重要的特征：

图2. 各模型特征重要行分析

图中展示了SVM、逻辑回归和随机森林模型的特征重要性。从图中可以看出，不同模型对于特征的重要性排序有所不同。

SVM模型: 主要关注价格相关的特征（如原价最低、原价最高）。
逻辑回归模型: 强调销量和价格相关的特征（如总销量、原价最低）。
随机森林模型: 综合考虑了价格、销量和评价相关的特征（如原价最低、销售价最高、总销售额、总销量）。

通过以上分析，我们可以得出以下结论：

数据可以自然地分为几个群体，每个群体可能代表具有相似特征的消费者或产品。
不同的分类模型在特征重要性排序上存在差异，这表明不同的模型对特征的重视程度不同。
SVM和随机森林模型表现非常优秀，能够很好地进行分类；逻辑回归模型也表现不错，但稍逊于前两者。

以上分析结果为数据驱动的决策提供了有力支持。例如，根据不同群体的特征，可以针对性地制定营销策略；通过识别重要特征，可以优化产品设计和改进服务。

2.结果呈现

本报告旨在分析WIS美白防晒霜喷的数据，通过使用多种图表类型展示市场规模、行业趋势、交易指数趋势、热卖价格波段等关键指标，以揭示数据中的潜在模式和特征重要性。

2.1行业规模图

行业规模图展示了30天销售额和总销售额的行业规模情况。

从图中可以看出，总销售额远高于30天销售额，表明行业在过去的长期内积累了大量销售，而近期30天的销售表现相对较低。这可能是由于季节性因素或市场波动引起的。

2.2 热卖价格波段图

热卖价格波段图展示了不同价格范围内产品的数量分布情况。

从图中可以看出，大多数热卖产品集中在某些特定的价格区间内。了解这些价格区间有助于优化产品定价策略，以最大化销售。

2.3 数据可视化图表饼图

饼图展示了30天销售额和总销售额的分布情况。

柱状图展示了30天销售额和总销售额的绝对值，表明总销售额明显高于30天销售额。饼图展示了这两者在整体销售中的比例，进一步强化了这一点。

2.3 词云

对商品的评论，问题-回答三个回答进行词云分析，如下图所示：

图评论词云图

图提问词云图

图回答词云图

2.7 结论

通过以上图表分析，我们可以得出以下结论：

1. 行业的长期销售表现（总销售额）显著高于近期销售表现（30天销售额）。
2. 30天销售额和30天销量之间存在密切的正相关关系。
3. 市场活动存在明显的季节性波动或特定时间段的促销活动影响。
4. 热卖产品主要集中在特定的价格区间内，优化定价策略有助于提高销售。

代码：

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns


# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
print(1)
plt.rcParams['axes.unicode_minus']=False
#%%

#%%
# 读取Excel文件
file_path = '副本WIS美白防晒霜喷.xlsx'
df = pd.read_excel(file_path)
#%%
# 数据标准化处理
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df.select_dtypes(include=['float64', 'int64']))

# Kmeans聚类分析
kmeans = KMeans(n_clusters=5, random_state=0)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# 聚类结果可视化
plt.figure(figsize=(10, 6))
sns.scatterplot(x=scaled_data[:, 0], y=scaled_data[:, 1], hue=df['Cluster'], palette='viridis')
plt.title('Kmeans Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(title='Cluster')
plt.savefig('聚类图.png')
plt.show()

#%%
# 特征和目标变量选择
X = df.select_dtypes(include=['float64', 'int64'])
y = df['Cluster']
# 数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# SVM分类模型
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# SVM模型评分
y_pred_svm = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print(f"SVM Accuracy: {svm_accuracy}")
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

# 提取SVM特征重要性
svm_feature_importance = np.abs(svm_model.coef_[0])
feature_names = X.columns
svm_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': svm_feature_importance}).sort_values(by='Importance', ascending=False)
#%%

# 逻辑回归分类模型
logreg_model = LogisticRegression(max_iter=1000)
logreg_model.fit(X_train, y_train)

# 逻辑回归模型评分
y_pred_logreg = logreg_model.predict(X_test)
logreg_accuracy = accuracy_score(y_test, y_pred_logreg)
print(f"Logistic Regression Accuracy: {logreg_accuracy}")
print(confusion_matrix(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

# 提取逻辑回归特征重要性
logreg_feature_importance = np.abs(logreg_model.coef_[0])
logreg_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': logreg_feature_importance}).sort_values(by='Importance', ascending=False)

# 随机森林分类模型
rf_model = RandomForestClassifier(random_state=0)
rf_model.fit(X_train, y_train)

# 随机森林模型评分
y_pred_rf = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {rf_accuracy}")
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

# 提取随机森林特征重要性
rf_feature_importance = rf_model.feature_importances_
rf_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': rf_feature_importance}).sort_values(by='Importance', ascending=False)

# 绘制特征重要性图表
plt.figure(figsize=(18, 8))

# SVM特征重要性
plt.subplot(1, 3, 1)
plt.barh(svm_importance_df['Feature'], svm_importance_df['Importance'])
plt.title('SVM Feature Importance')

# 逻辑回归特征重要性
plt.subplot(1, 3, 2)
plt.barh(logreg_importance_df['Feature'], logreg_importance_df['Importance'])
plt.title('Logistic Regression Feature Importance')

# 随机森林特征重要性
plt.subplot(1, 3, 3)
plt.barh(rf_importance_df['Feature'], rf_importance_df['Importance'])
plt.title('Random Forest Feature Importance')

plt.tight_layout()
plt.savefig('特征重要行.png')
plt.show()
#%%
# 行业规模图示例
industry_size_data = df[['30天销售额', '总销售额']].sum()
industry_size_data.plot(kind='bar', title='Industry Size')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.savefig('行业规模图示例.png')
plt.show()
#%%
# 行业趋势图示例
trend_data = df[['30天销售额', '30天销量']].groupby(df.index // 10).mean()
trend_data.plot(kind='line', title='Industry Trend')
plt.xlabel('Time')
plt.ylabel('Value')
plt.show()
#%%
# 交易指数趋势图示例
transaction_index_data = df[['30天销售额']].groupby(df.index // 10).mean()
transaction_index_data.plot(kind='line', title='Transaction Index Trend')
plt.xlabel('Time')
plt.ylabel('Transaction Index')
plt.show()
#%%
# 同期交易金额增幅情况对比图示例
#同期交易金额增幅情况对比图示例
同期交易金额增幅情况对比数据 = df[['30天销售额']].pct_change().dropna()
同期交易金额增幅情况对比数据.plot(kind='bar', title='Transaction Amount Growth Comparison')
plt.xlabel('Time')
plt.ylabel('Growth Rate')
plt.show()
#%%
# 热卖价格波段图示例
hot_selling_price_data = df['销售价最低'].value_counts().sort_index()
hot_selling_price_data.plot(kind='bar', title='Hot Selling Price Bands')
plt.xlabel('Price Range')
plt.savefig('热卖价格波段图示例.png',dpi=200)
plt.ylabel('Number of Products')
plt.show()
#%%
# 数据可视化图表（柱状图和饼图）
# 使用 '30天销售额' 和 '总销售额' 列进行可视化

# 柱状图
sales_data = df[['30天销售额', '总销售额']].sum()
plt.figure(figsize=(10, 6))
sales_data.plot(kind='bar', title='Sales Data')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()

# 饼图
plt.figure(figsize=(10, 6))
sales_data.plot(kind='pie', title='Sales Distribution', autopct='%1.1f%%')
plt.ylabel('')
plt.savefig('饼图.png',dpi=200)

plt.show()
#%%
import pandas as pd

# Load the Excel file
file_path = '副本防晒霜热词.xlsx'
data = pd.read_excel(file_path)

# Display the first few rows of the dataframe to understand its structure
data.head()
#%%
import jieba

# 读取评论内容文本文件
with open('./data/questions.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# 进行中文分词
segmented_text = ' '.join(jieba.cut(text))

#%%
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus']=False
# 读取分词结果文件


# 生成词云
wordcloud = WordCloud(font_path='SIMYOU.TTF',  # 替换为中文字体文件路径
                      width=800,
                      height=400,
                      background_color='white').generate(segmented_text)

# 显示词云
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.savefig('词云-提问.png')
plt.show()
#%%

#%%
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 加载停用词表
def load_stopwords(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        stopwords = set(file.read().splitlines())
    return stopwords

# 分词并去除停用词
def tokenize_and_remove_stopwords(text, stopwords):
    tokens = jieba.cut(text)
    filtered_tokens = [token for token in tokens if token not in stopwords and token.strip()]
    return ' '.join(filtered_tokens)

# 生成词云
def generate_wordcloud(text, stopwords, font_path, output_file):
    # 进行中文分词并去除停用词
    segmented_text = tokenize_and_remove_stopwords(text, stopwords)

    # 生成词云
    wordcloud = WordCloud(font_path=font_path,
                          width=800,
                          height=400,
                          background_color='white').generate(segmented_text)

    # 显示词云并保存
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.savefig(output_file)
    plt.show()

# 加载停用词
stopwords = load_stopwords('data/停用词.txt')

# 处理questions.txt
with open('./data/questions.txt', 'r', encoding='utf-8') as file:
    questions_text = file.read()
generate_wordcloud(questions_text, stopwords, 'SIMYOU.TTF', '词云-提问.png')

# 处理answers.txt
with open('./data/answers.txt', 'r', encoding='utf-8') as file:
    answers_text = file.read()
generate_wordcloud(answers_text, stopwords, 'SIMYOU.TTF', '词云-回答.png')