基于Python广告点击率预测分析

该项目旨在预测广告点击率,通过数据预处理、特征工程、建立逻辑回归和随机森林模型,发现年龄、在线时间、国家、收入等因素影响点击率。随机森林在测试集上表现优于逻辑回归。
摘要由CSDN通过智能技术生成

本项目研究目的:本项目目的是通过给定的广告信息和用户信息来预测一个广告被点击与否, 如果广告有很大概率被点击就展示广告,如果概率低,就不展示。
本项目分析步骤:
(1)了解数据
(2)提取新特征
(3)检查目标变量的分布
(4)了解变量之间的关系
(5)识别潜在异常值
(6)建立基本模型
(7)特征工程
(8)建立逻辑回归模型
(9)建立随机森林模型
(10)对测试数据进行模型评估
(11)重要的特征

# Load Libraries

import numpy as np     #linear algebra
import pandas as pd    #data processing
import matplotlib.pyplot as plt    #visualizations
import seaborn as sns       #visualizations
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
import warnings    #hide warning messages
warnings.filterwarnings("ignore")
%matplotlib inline
# Load Data

df = pd.read_csv("advertising.csv") #reading the file
# examine the data

df.head(10) #cheking the first 10 rows of the data
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0
5 59.99 23 59761.56 226.74 Sharable client-driven software Jamieberg 1 Norway 2016-05-19 14:30:17 0
6 88.91 33 53852.85 208.36 Enhanced dedicated support Brandonstad 0 Myanmar 2016-01-28 20:59:32 0
7 66.00 48 24593.33 131.76 Reactive local challenge Port Jefferybury 1 Australia 2016-03-07 01:40:15 1
8 74.53 30 68862.00 221.51 Configurable coherent function West Colin 1 Grenada 2016-04-18 09:33:42 0
9 69.88 20 55642.32 183.82 Mandatory homogeneous architecture Ramirezton 1 Ghana 2016-07-11 01:42:51 0
# data type and length of the variables

df.info()  #gives the information about the data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
# duplicates checkup

df.duplicated().sum() #displays duplicate records
0
# numerical and categorical variables identification

df.columns #displays column names
Index(['Daily Time Spent on Site', 'Age', 'Area Income',
       'Daily Internet Usage', 'Ad Topic Line', 'City', 'Male', 'Country',
       'Timestamp', 'Clicked on Ad'],
      dtype='object')
df.select_dtypes(include=['object']).columns  #displays categorical variables which are detected by python
Index(['Ad Topic Line', 'City', 'Country', 'Timestamp'], dtype='object')
# assigning columns as numerical variables
numeric_cols = ['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage']
# Assigning columns as categorical variables
Categorical_cols = [ 'Ad Topic Line', 'City', 'Male', 'Country', 'Clicked on Ad' ]
# Summarizing Numerical Variables

df[numeric_cols].describe()
Daily Time Spent on Site Age Area Income Daily Internet Usage
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 65.000200 36.009000 55000.000080 180.000100
std 15.853615 8.785562 13414.634022 43.902339
min 32.600000 19.000000 13996.500000 104.780000
25% 51.360000 29.000000 47031.802500 138.830000
50% 68.215000 35.000000 57012.300000 183.130000
75% 78.547500 42.000000 65470.635000 218.792500
max 91.430000 61.000000 79484.800000 269.960000

由于均值和中位数(50%百分位数)非常相似,这表明我们的数据没有偏斜并且我们不需要任何数据转换。

# Summarizing Categorical Variables

df[Categorical_cols].describe(include = ['O'])
Ad Topic Line City Country
count 1000 1000 1000
unique 1000 969 237
top Object-based neutral policy Williamsport France
freq 1 3 9

由于我们有许多不同的城市(唯一),也没有多少人属于同一城市(频率)。 因此,这可能意味着该功能没有或具有很小的预测能力。 但是,我们在国家特征方面的多样性较少,因此我们必须对国家进行进一步分析。

#Investing Country Variable
pd.crosstab(df['Country'], df['Clicked on Ad']).sort_values(1,0,ascending=False).head(20) #先按1列降序排序,再按0列降序排序
Clicked on Ad 0 1
Country
Australia 1 7
Turkey 1 7
Ethiopia 0 7
Liberia 2 6
South Africa 2 6
Liechtenstein 0 6
Senegal 3 5
Peru 3 5
Mayotte 1 5
Hungary 1 5
France 4 5
Afghanistan 3 5
Zimbabwe 2 4
Indonesia 2 4
China 2 4
Svalbard & Jan Mayen Islands 2 4
Jersey 2 4
Kenya 0 4
Antigua and Barbuda 1 4
Hong Kong 2 4
pd.crosstab(index=df['Country'], columns='count').sort_values(['count'], ascending=False).head(10)
col_0 count
Country
France 9
Czech Republic 9
Afghanistan 8
Australia 8
Turkey 8
South Africa 8
Senegal 8
Peru 8
Micronesia 8
Greece 8

似乎来自世界各地的用户数量最多,来自法国和捷克共和国的用户数量分别为9。

# Check for Missing Values

df.isnull().sum()  #number of missing values in each column
Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64
# extract datetime variables using timestamp column

# Converting timestamp column into datatime object in order to extract new features
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
# Creates a new column called Month
df['Month'] = df['Timestamp'].dt.month
# Creates a new column called Day
df['Day'] = df['Timestamp'].dt.day
# Creates a new column called Hour
df['Hour'] = df['Timestamp'].dt.hour
# Creates a new column called Weekday with sunday as 6 and monday as 0
df['Weekday'] = df['Timestamp'].dt.dayofweek
# Dropping timestamp column to avoid redundancy
df = df.drop(['Timestamp'], axis=1)
df.head()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Clicked on Ad Month Day Hour Weekday
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 0 3 27 0 6
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 0 4 4 1 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 0 3 13 20 6
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 0 1 10 2 6
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 0 6 3 3 4
# visualize target variable clicked on ad

plt.figure(figsize=(14, 6))
plt.subplot(1,2,1)
sns.countplot(x='Clicked on Ad', data=df)
plt.subplot(1,2,2)
sns.distplot(df['Clicked on Ad'], bins=20)
plt.show()

在这里插入图片描述
从图中可以看出,点击广告的用户数量与未点击的用户数量相等(即500)。

# joinplot of daily time spent on site and age
sns.jointplot(x='Age', y='Daily Time Spent on Site', data=df)

在这里插入图片描述
我们可以看到,越来越多的30至40岁的人每天在网站上花费更多的时间。

# Distribution and Relationship Between Variables

#creating a pairplot with hue defined by clicked on ad column
sns.pairplot(df, hue='Clicked on Ad', vars=['Daily Time Spent on Site', 'Age', 'Area Income', 
                                            'Daily Internet Usage'], palette='husl')

在这里插入图片描述
我们可以看到,人们在网站上花费的时间较少,收入较少且年龄相对较大的人倾向于点击广告。

plots = ['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage']
for i in plots:
    plt.figure(figsize=(14, 6))
    plt.subplot(1,2,1)
    sns.boxenplot(df[i])
    plt.subplot(1,2,2)
    sns.distplot(df[i], bins=20)
    plt.title(i)
    plt.show()

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
我们可以清楚地看到,网站的日常使用和每天花费的时间有2个高峰(以统计数据为Bi模型)。 这表明我们的数据中存在两个不同的组。 我们不希望用户分布正常,因为有些人会花更多时间在Internet /网站上,而有些人会花更少的时间。

print('oldest person was of: ', df['Age'].max(), 'Years')
print('Youngest person was of: ', df['Age'].min(), 'Years')
print(
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值