利用kimi帮助自己完成大数据项目！（竞赛网站用户行为分析）

最新推荐文章于 2025-02-27 11:25:19 发布

江河之流

最新推荐文章于 2025-02-27 11:25:19 发布

阅读量1.1k

点赞数 16

分类专栏：开源节流计划文章标签：大数据

本文链接：https://blog.csdn.net/2303_77434440/article/details/143174067

版权

开源节流计划专栏收录该内容

35 篇文章

订阅专栏

为什么用人工智能？因为不用人工智能的会被使用人工智能給打败！

目前，互联网的用户规模已不容小视，互联网市场潜力巨大，各大网站运营商都在采取积极的措施，分析用户的行为特征，对不同客户群提供差异化的服务，以达到精准营销的目的。本案例依据用户的历史浏览记录，研究用户的兴趣偏好，分析需求并发现用户的兴趣点，从而将用户分成不同群体。公司后续可以针对不同群体提供差异化的服务，提高用户的使用体验。

通过学习本案例，可掌握用户识别、用户行为分析和构建分群指标的方法，实现用户分群的主要方法和技能，并为后续相关课程学习及将来从事数据分析工作奠定基础。

分为以下四个步骤：

案例背景和目标

用户规模是一个逐渐在扩大的趋势，运营商采取积极措施，分析用户特征，根据不同的客户群向其提供差异化服务（也就是注意力经济，个性化服务（“精准营销”））根据相关的注册资料（性别，年龄，区域，职业）对于用户进行划分群体（缺少了用户的行为特征和兴趣偏好，很难为精准营销提供决策支持），解决技术（采用数据挖掘的方法分析用户的历史浏览行为，根据用户的历史浏览行为特征建立用户自动分群模型！）

目标：

用户的增加，公司越来越难掌握用户的需求（最终目的，利用个性化服务，解决用户的需求！）

分析流程：

获取数据，数据预处理，用户识别和行为分析，构建用户分群模型，分析各群体用户特征

将用户的数据通过行列矩阵变成一个个可以进行描述的点：

数据预处理

模型构建

分析结果与小结

当然，以下是对您提供的 Jupyter 笔记本内容的整理，以便您进行记忆和阅读：

# 属性规约和数据变换

## 导入库

```python
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
```

## 属性规约

```python
oringal_data = pd.read_csv('./user_cluster.csv', encoding='gbk')
var = ["page_path", "username", "userid", "ip", "sessionid", "date_time", "uniqueVisitorId"]
tipdm_data = oringal_data[var]
```

## 数据预览

```python
tipdm_data
```

## 处理 IP 和 Session ID

```python
ip_sessionid = tipdm_data[['ip', 'sessionid']].drop_duplicates()
sessionid_count = pd.DataFrame(ip_sessionid.groupby(['sessionid'])['ip'].count())
sessionid_count['sessionid'] = sessionid_count.index.tolist()

rept_sessionid = sessionid_count[sessionid_count.ip > 1].iloc[:, 0].tolist()
for i in range(len(rept_sessionid)):
    rept_num = tipdm_data[tipdm_data['sessionid'] == rept_sessionid[i]].index.tolist()
    tipdm_data['ip'].iloc[rept_num] = tipdm_data.loc[rept_num[0], 'ip']
```

## 处理 User ID

```python
userid_sessionid = tipdm_data[["userid", "sessionid"]].drop_duplicates().reset_index(drop=True)

sessionid_count_1 = pd.DataFrame(userid_sessionid.groupby(["sessionid"])["userid"].count())
sessionid_count_1["sessionid"] = sessionid_count_1.index.tolist()
sessionid_count_1.columns = ['count', 'sessionid']

rept_sessionid_1 = sessionid_count_1[sessionid_count_1["count"] > 1].iloc[:, 0].tolist()

for i in range(len(rept_sessionid_1)):
    rept_num_1 = tipdm_data[tipdm_data.loc[:, "sessionid"] == rept_sessionid_1[i]].index.tolist()
    rept_data = tipdm_data["userid"].iloc[rept_num_1]
    tipdm_data["userid"].iloc[rept_num_1] = rept_data[rept_data.isnull() == False].iloc[0]
```

# 用户识别和行为分析

## 用户识别

```python
na_index = tipdm_data[tipdm_data['userid'].isnull()].index.tolist()
na_userid = tipdm_data.iloc[na_index].reset_index(drop=True)
nona_userid = tipdm_data.drop(index=na_index).reset_index(drop=True)
```

## 用户数据形状

```python
print(na_userid.shape)
print(nona_userid.shape)
```

## 用户 ID 处理

```python
na_index_1 = na_userid[na_userid['uniqueVisitorId'].isnull()].index.tolist()
na_uniqueVisitorld = na_userid.iloc[na_index_1]
nona_uniqueVisitorld = na_userid.drop(index=na_index_1)

na_uniqueVisitorld.loc[:, 'userid'] = na_uniqueVisitorld['ip']
nona_uniqueVisitorld.loc[:, 'userid'] = na_uniqueVisitorld['uniqueVisitorId']

data = pd.concat([nona_userid, na_uniqueVisitorld, nona_uniqueVisitorld], axis=0)
data['userid'] = data['userid'].apply(lambda x: str(x))
data['realId'] = data['userid'].rank()

total_user = len(data['realId'].drop_duplicates())
```

## 行为分析

```python
reallid_sessionid = data[['realId', 'sessionid']].drop_duplicates()

reallid_count = pd.DataFrame(reallid_sessionid.groupby('realId')['realId'].count())
reallid_count.columns = ['count']
reallid_count['realId'] = reallid_count.index.tolist()
```

## 提取只登录一次的用户

```python
click_one_user = reallid_count['realId'][reallid_count['count'] == 1].tolist()

index = []
for x in click_one_user:
    index_1 = data[data['realId'] == x].index.tolist()
    for y in index_1:
        index.append(y)
click_one_data = data.iloc[index]

reallid_count_1 = pd.DataFrame(click_one_data.groupby('realId')['realId'].count())
reallid_count_1.columns = ['count']
reallid_count_1['realId'] = reallid_count_1.index.tolist()
```

## 提取只登录了一次且点击一个网页的用户

```python
one_click_user = reallid_count_1['realId'][reallid_count_1['count'] == 1].tolist()

user = data['realId'].drop_duplicates()

user1 = []
for x in user:
    if x not in one_click_user:
        user1.append(x)

new_index = []
for x in user:
    new_index_1 = data[data['realId'] == x].index.tolist()
    for y in new_index_1:
        new_index.append(y)

new_data = data.iloc[new_index]
```

# 网址分类和数据清洗

## 新增序号

```python
new_data['number_id'] = list(range(new_data.shape[0]))
new_data = new_data.reset_index(drop=True)

jhtml_index = new_data[new_data['page_path'].apply(lambda x: str(x).endswith('.jhtml'))].index.tolist()
jhtml_data = new_data.iloc[jhtml_index]
jhtml_data.to_csv('./jhtml_data.csv', header=1, index=1)

unjhtml_data = new_data.drop(index=jhtml_index)

unjhtml_count = pd.DataFrame(unjhtml_data.groupby('page_path')['page_path'].count())
unjhtml_count.columns = ['count']
unjhtml_count['page_path'] = unjhtml_count.index.tolist()
unjhtml_count = unjhtml_count[unjhtml_count.iloc[:, 0] > 1]
```

## 用户历史点击量统计

```python
jhtml_data = pd.read_csv('./jhtml_data.csv')
total_click = pd.DataFrame(jhtml_data.groupby('realId')['realId'].count())
total_click.columns = ['count']
total_click['realId'] = total_click.index.tolist()

total_click = total_click.sort_values(by='count', ascending=True)

more40_user = total_click[total_click.iloc[:, 0] > 40]
```

## 去除点击量大于40的用户数据

```python
num = []
for x in more40_user.iloc[:, 0]:
    num_index_1 = jhtml_data[jhtml_data['realId'] == x].index.tolist()
    for y in num_index_1:
        num.append(y)
jhtml_data1 = jhtml_data.drop(index=num)
```

# 网页分类和属性构造

## 提取 page_path 和 realId

```python
mode_data = jhtml_data[['page_path', 'realId']]

det_num = jhtml_data[jhtml_data['page_path'].apply(lambda x: re.search(r'%', str(x)) != None)].index.tolist()

mode_data = mode_data.drop(index=det_num).reset_index(drop=True)

mode_data["page_label"] = mode_data["page_path"].str.replace(".jhtml", "").str.replace("\d", "").str.replace("index", "").str.replace("_", "").str.replace("/", "")

space_num = mode_data[mode_data['page_path'] == ''].index.tolist()
mode_data = mode_data.drop(index=space_num)

url_concet = pd.DataFrame(mode_data.groupby('page_label')['page_label'].count())
url_concet.rename(columns={'page_label': 'Page_label'}, inplace=True)
url_concet = url_concet.sort_values('page_label', ascending=False)

url = url_concet.index.tolist()[0: 26]
```

## 网页汇总

```python
mode_data = jhtml_data[['page_path', 'realId']]

det_num = jhtml_data[jhtml_data['page_path'].apply(lambda x: re.search(r'%', str(x)) != None)].index.tolist()

mode_data = mode_data.drop(index=det_num).reset_index(drop=True)

mode_data["page_label"] = mode_data["page_path"].str.replace(".jhtml", "").str.replace("\d", "").str.replace("index", "").str.replace("_", "").str.replace("/", "")

space_num = mode_data[mode_data['page_path'] == ''].index.tolist()
mode_data = mode_data.drop(index=space_num)

url_concet = pd.DataFrame(mode_data.groupby('page_label')['page_label'].count())
url_concet.rename(columns={'page_label': 'Page_label'}, inplace=True)
url_concet = url_concet.sort_values('page_label', ascending=False)

url = url_concet.index.tolist()[0: 26]
```

不能回避ai的使用，因为一旦开始回避ai的使用就会出现各种各样的问题，不适用ai的人会被使用ai的人打败的！（推荐使用一个记忆工具：记忆助手！）

开始解决问题：

问题描述：

C:\Users\gigi\AppData\Roaming\Python\Python38\site-packages\IPython\core\interactiveshell.py:3135: DtypeWarning: Columns (2,3) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

在使用Pandas的`read_csv`函数读取CSV文件时，如果某些列包含混合数据类型（例如，同时包含~~数字和字符串~~），Pandas可能会发出~~`DtypeWarning`警告。~~这是因为Pandas在尝试推断列的数据类型时，发现了不一致性。为了解决这个问题，你有几个选项：

1. 设置`low_memory=False`：

这会使得Pandas一次性读取整个文件，而不是分块读取，从而减少数据类型推断的不一致性。这样做可以减少或消除`DtypeWarning`警告，但可能会增加内存的使用，特别是对于大型文件。例如：

  ```python
   df = pd.read_csv('your_file.csv', low_memory=False)
   ```

这种方法简单易行，但如果文件非常大，可能会消耗大量内存，甚至导致内存溢出。

2. 明确指定列的数据类型（推荐）：

通过`dtype`参数为列指定明确的数据类型，可以避免Pandas的类型推断，减少混合类型的问题。例如，如果你知道某列应该是字符串类型，可以这样设置：

   ```python
   df = pd.read_csv('your_file.csv', dtype={'column_name': str})
   ```

这种方法更加稳妥，尤其是对于大型数据集，因为它不会一次性将整个文件加载到内存中，从而避免了内存溢出的风险。

3. 使用转换器处理混合类型数据：

如果某些列中的数据类型不一致，你可以使用`converters`参数为这些列提供转换函数，将数据统一为期望的类型。例如：

 ```python
   def convert_to_float(value):
       try:
           return float(value)
       except ValueError:
           return 0.0
   
   df = pd.read_csv('your_file.csv', converters={'column_name': convert_to_float})
   ```

这种方法提供了更多的灵活性，允许你为列中的每个值定义自定义的转换逻辑。

4. 忽略警告：

如果你确定混合类型不会影响数据分析的结果，或者你已经知道如何处理这些数据，可以选择忽略这些警告。例如：

```python
   import warnings
   warnings.filterwarnings('ignore', category=pd.errors.DtypeWarning)
   ```

这种方法并不解决根本问题，但可以减少警告信息的干扰。

选择哪种方法取决于你的具体需求，包括数据集的大小、列中数据的一致性以及你对数据分析的精确度要求。对于大型数据集，推荐使用明确指定数据类型的方法，以避免内存溢出的风险。对于较小的数据集，或者当你需要快速处理数据而不需要非常精确的类型控制时，设置`low_memory=False`可能是一个更简单的解决方案。