1. 背景
About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.
There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.
2. 数据描述
train_users_2.csv - the training set of users (训练数据)
* id: user id (用户id)
* date_account_created(帐号注册时间): the date of account creation
* timestamp_first_active(首次活跃时间): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking(首次订房时间): date of first booking
* gender(性别)
* age(年龄)
* signup_method(注册方式)
* signup_flow(注册页面): the page a user came to signup up from
* language(语言): international language preference
* affiliate_channel(付费市场渠道): what kind of paid marketing
* affiliate_provider(付费市场渠道名称): where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked(注册前第一个接触的市场渠道): whats the first marketing the user interacted with before the signing up
* signup_app(注册app)
* first_device_type(设备类型)
* first_browser(浏览器类型)
* country_destination(订房国家-需要预测的量)
: this is the target variable you are to predict
test_users.csv - the test set of users (测试数据)
sessions.csv - web sessions log for users(网页浏览数据)
* user_id(用户id): to be joined with the column ‘id’ in users table
* action(用户行为)
* action_type(用户行为类型)
* action_detail(用户行为具体)
* device_type(设备类型)
* secs_elapsed(停留时长)
3. 探索性分析与特征工程
3.1 train_user_2和test_user文件
导入所需库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import datetime
import os
读取文件
train = pd.read_csv("../input/train_users_2.csv")
test = pd.read_csv("../input/test_users.csv")
train.head()
id | date_account_created | timestamp_first_active | date_first_booking | gender | age | signup_method | signup_flow | language | affiliate_channel | affiliate_provider | first_affiliate_tracked | signup_app | first_device_type | first_browser | country_destination | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gxn3p5htnn | 2010-06-28 | 20090319043255 | NaN | -unknown- | NaN | 0 | en | direct | direct | untracked | Web | Mac Desktop | Chrome | NDF | |
1 | 820tgsjxq7 | 2011-05-25 | 20090523174809 | NaN | MALE | 38.0 | 0 | en | seo | untracked | Web | Mac Desktop | Chrome | NDF | ||
2 | 4ft3gnwmtx | 2010-09-28 | 20090609231247 | 2010-08-02 | FEMALE | 56.0 | basic | 3 | en | direct | direct | untracked | Web | Windows Desktop | IE | US |
3 | bjjt8pjhuk | 2011-12-05 | 20091031060129 | 2012-09-08 | FEMALE | 42.0 | 0 | en | direct | direct | untracked | Web | Mac Desktop | Firefox | other | |
4 | 87mebub9p4 | 2010-09-14 | 20091208061105 | 2010-02-18 | -unknown- | 41.0 | basic | 0 | en | direct | direct | untracked | Web | Mac Desktop | Chrome | US |
查看数据所包含的特征
print("Column names for training dataset : ")
for column in train.columns:
print("-", column)
Column names for training dataset :
- id
- date_account_created
- timestamp_first_active
- date_first_booking
- gender
- age
- signup_method
- signup_flow
- language
- affiliate_channel
- affiliate_provider
- first_affiliate_tracked
- signup_app
- first_device_type
- first_browser
- country_destination
查看数据信息
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 16 columns):
id 213451 non-null object
date_account_created 213451 non-null object
timestamp_first_active 213451 non-null int64
date_first_booking 88908 non-null object
gender 213451 non-null object
age 125461 non-null float64
signup_method 213451 non-null object
signup_flow 213451 non-null int64
language 213451 non-null object
affiliate_channel 213451 non-null object
affiliate_provider 213451 non-null object
first_affiliate_tracked 207386 non-null object
signup_app 213451 non-null object
first_device_type 213451 non-null object
first_browser 213451 non-null object
country_destination 213451 non-null object
dtypes: float64(1), int64(2), object(13)
memory usage: 26.1+ MB
1. train文件包含213451行数据,16个特征
2. 各特征的数据类型和空值情况
3. age空值较多,特征提取时考虑将空值单独作为一个类别
4. date_first_booking空值较多,在特征提取时可以考虑删除
探索性分析
date_account_created特征
dac_train = pd.to_datetime(train.date_account_created).value_counts()
dac_test = pd.to_datetime(test.date_account_created).