基于Python广告点击率预测分析

最新推荐文章于 2024-09-05 23:00:35 发布

HEIHEIHEITANG

最新推荐文章于 2024-09-05 23:00:35 发布

阅读量3.9k

点赞数 1

文章标签：大数据

本文链接：https://blog.csdn.net/KongQiDeNvEr/article/details/107897850

版权

该项目旨在预测广告点击率，通过数据预处理、特征工程、建立逻辑回归和随机森林模型，发现年龄、在线时间、国家、收入等因素影响点击率。随机森林在测试集上表现优于逻辑回归。

摘要由CSDN通过智能技术生成

本项目研究目的：本项目目的是通过给定的广告信息和用户信息来预测一个广告被点击与否，如果广告有很大概率被点击就展示广告，如果概率低，就不展示。
本项目分析步骤：
（1）了解数据
（2）提取新特征
（3）检查目标变量的分布
（4）了解变量之间的关系
（5）识别潜在异常值
（6）建立基本模型
（7）特征工程
（8）建立逻辑回归模型
（9）建立随机森林模型
（10）对测试数据进行模型评估
（11）重要的特征

# Load Libraries

import numpy as np     #linear algebra
import pandas as pd    #data processing
import matplotlib.pyplot as plt    #visualizations
import seaborn as sns       #visualizations
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
import warnings    #hide warning messages
warnings.filterwarnings("ignore")
%matplotlib inline

# Load Data

df = pd.read_csv("advertising.csv") #reading the file

# examine the data

df.head(10) #cheking the first 10 rows of the data

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp	Clicked on Ad
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11	0
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02	0
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42	0
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19	0
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18	0
5	59.99	23	59761.56	226.74	Sharable client-driven software	Jamieberg	1	Norway	2016-05-19 14:30:17	0
6	88.91	33	53852.85	208.36	Enhanced dedicated support	Brandonstad	0	Myanmar	2016-01-28 20:59:32	0
7	66.00	48	24593.33	131.76	Reactive local challenge	Port Jefferybury	1	Australia	2016-03-07 01:40:15	1
8	74.53	30	68862.00	221.51	Configurable coherent function	West Colin	1	Grenada	2016-04-18 09:33:42	0
9	69.88	20	55642.32	183.82	Mandatory homogeneous architecture	Ramirezton	1	Ghana	2016-07-11 01:42:51	0

# data type and length of the variables

df.info()  #gives the information about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB

# duplicates checkup

df.duplicated().sum() #displays duplicate records

# numerical and categorical variables identification

df.columns #displays column names

Index(['Daily Time Spent on Site', 'Age', 'Area Income',
       'Daily Internet Usage', 'Ad Topic Line', 'City', 'Male', 'Country',
       'Timestamp', 'Clicked on Ad'],
      dtype='object')

df.select_dtypes(include=['object']).columns  #displays categorical variables which are detected by python

Index(['Ad Topic Line', 'City', 'Country', 'Timestamp'], dtype='object')

# assigning columns as numerical variables
numeric_cols = ['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage']
# Assigning columns as categorical variables
Categorical_cols = [ 'Ad Topic Line', 'City', 'Male', 'Country', 'Clicked on Ad' ]

# Summarizing Numerical Variables

df[numeric_cols].describe()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	65.000200	36.009000	55000.000080	180.000100
std	15.853615	8.785562	13414.634022	43.902339
min	32.600000	19.000000	13996.500000	104.780000
25%	51.360000	29.000000	47031.802500	138.830000
50%	68.215000	35.000000	57012.300000	183.130000
75%	78.547500	42.000000	65470.635000	218.792500
max	91.430000	61.000000	79484.800000	269.960000

由于均值和中位数（50％百分位数）非常相似，这表明我们的数据没有偏斜并且我们不需要任何数据转换。

# Summarizing Categorical Variables

df[Categorical_cols].describe(include = ['O'])

	Ad Topic Line	City	Country
count	1000	1000	1000
unique	1000	969	237
top	Object-based neutral policy	Williamsport	France
freq	1	3	9

由于我们有许多不同的城市（唯一），也没有多少人属于同一城市（频率）。因此，这可能意味着该功能没有或具有很小的预测能力。但是，我们在国家特征方面的多样性较少，因此我们必须对国家进行进一步分析。

#Investing Country Variable
pd.crosstab(df['Country'], df['Clicked on Ad']).sort_values(1,0,ascending=False).head(20) #先按1列降序排序，再按0列降序排序

Clicked on Ad	0	1
Country
Australia	1	7
Turkey	1	7
Ethiopia	0	7
Liberia	2	6
South Africa	2	6
Liechtenstein	0	6
Senegal	3	5
Peru	3	5
Mayotte	1	5
Hungary	1	5
France	4	5
Afghanistan	3	5
Zimbabwe	2	4
Indonesia	2	4
China	2	4
Svalbard & Jan Mayen Islands	2	4
Jersey	2	4
Kenya	0	4
Antigua and Barbuda	1	4
Hong Kong	2	4

pd.crosstab(index=df['Country'], columns='count').sort_values(['count'], ascending=False).head(10)

col_0	count
Country
France	9
Czech Republic	9
Afghanistan	8
Australia	8
Turkey	8
South Africa	8
Senegal	8
Peru	8
Micronesia	8
Greece	8

似乎来自世界各地的用户数量最多，来自法国和捷克共和国的用户数量分别为9。

# Check for Missing Values

df.isnull().sum()  #number of missing values in each column

Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64

# extract datetime variables using timestamp column

# Converting timestamp column into datatime object in order to extract new features
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
# Creates a new column called Month
df['Month'] = df['Timestamp'].dt.month
# Creates a new column called Day
df['Day'] = df['Timestamp'].dt.day
# Creates a new column called Hour
df['Hour'] = df['Timestamp'].dt.hour
# Creates a new column called Weekday with sunday as 6 and monday as 0
df['Weekday'] = df['Timestamp'].dt.dayofweek
# Dropping timestamp column to avoid redundancy
df = df.drop(['Timestamp'], axis=1)
df.head()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Month	Day	Hour	Weekday
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3	27	0	6
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4	4	1	0
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3	13	20	6
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1	10	2	6
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6	3	3	4

# visualize target variable clicked on ad

plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.countplot(x='Clicked on Ad', data=df)
plt.subplot(1,2,2)
sns.distplot(df['Clicked on Ad'], bins=20)
plt.show()

在这里插入图片描述
从图中可以看出，点击广告的用户数量与未点击的用户数量相等（即500）。

# joinplot of daily time spent on site and age
sns.jointplot(x='Age', y='Daily Time Spent on Site', data=df)

在这里插入图片描述
我们可以看到，越来越多的30至40岁的人每天在网站上花费更多的时间。

# Distribution and Relationship Between Variables

#creating a pairplot with hue defined by clicked on ad column
sns.pairplot(df, hue='Clicked on Ad', vars=['Daily Time Spent on Site', 'Age', 'Area Income', 
                                            'Daily Internet Usage'], palette='husl')

在这里插入图片描述
我们可以看到，人们在网站上花费的时间较少，收入较少且年龄相对较大的人倾向于点击广告。

plots = ['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage']
for i in plots:
    plt.figure(figsize=(14, 6))
    plt.subplot(1,2,1)
    sns.boxenplot(df[i])
    plt.subplot(1,2,2)
    sns.distplot(df[i], bins=20)
    plt.title(i)
    plt.show()

在这里插入图片描述

在这里插入图片描述

我们可以清楚地看到，网站的日常使用和每天花费的时间有2个高峰（以统计数据为Bi模型）。这表明我们的数据中存在两个不同的组。我们不希望用户分布正常，因为有些人会花更多时间在Internet /网站上，而有些人会花更少的时间。

print('oldest person was of: ', df['Age'].max(), 'Years')
print('Youngest person was of: ', df['Age'].min(), 'Years')
print(

最低0.47元/天解锁文章

HEIHEIHEITANG

关注

1
点赞
踩
30

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫