Airbnb新用户民宿预定情况预测

最新推荐文章于 2024-09-08 09:48:38 发布

littleadams

最新推荐文章于 2024-09-08 09:48:38 发布

阅读量2.9k

点赞数 2

文章标签：机器学习数据分析

本文链接：https://blog.csdn.net/weixin_40362097/article/details/81624243

版权

这篇博客探讨了预测新用户首次预订Airbnb民宿的国家问题，使用了包括用户基本信息、网络会话记录和汇总统计在内的数据。通过探索性数据分析和特征工程，如处理时间戳、缺失值和类别型数据，对数据进行了预处理，为机器学习模型建立打下基础。

摘要由CSDN通过智能技术生成

1. 背景

About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

2. 数据描述

train_users_2.csv - the training set of users （训练数据)
* id: user id （用户id）
* date_account_created（帐号注册时间）: the date of account creation
* timestamp_first_active（首次活跃时间）: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking（首次订房时间）: date of first booking
* gender（性别）
* age（年龄）
* signup_method（注册方式）
* signup_flow（注册页面）: the page a user came to signup up from
* language（语言）: international language preference
* affiliate_channel（付费市场渠道）: what kind of paid marketing
* affiliate_provider（付费市场渠道名称）: where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked（注册前第一个接触的市场渠道）: whats the first marketing the user interacted with before the signing up
* signup_app（注册app）
* first_device_type(设备类型)
* first_browser（浏览器类型）
* country_destination（订房国家-需要预测的量）: this is the target variable you are to predict
test_users.csv - the test set of users （测试数据)
sessions.csv - web sessions log for users（网页浏览数据）
* user_id（用户id）: to be joined with the column ‘id’ in users table
* action(用户行为)
* action_type（用户行为类型）
* action_detail（用户行为具体）
* device_type（设备类型）
* secs_elapsed（停留时长）

3. 探索性分析与特征工程

3.1 train_user_2和test_user文件

导入所需库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import datetime
import os

读取文件

train = pd.read_csv("../input/train_users_2.csv")
test = pd.read_csv("../input/test_users.csv")

train.head()

	id	date_account_created	timestamp_first_active	date_first_booking	gender	age	signup_method	signup_flow	language	affiliate_channel	affiliate_provider	first_affiliate_tracked	signup_app	first_device_type	first_browser	country_destination
0	gxn3p5htnn	2010-06-28	20090319043255	NaN	-unknown-	NaN	facebook	0	en	direct	direct	untracked	Web	Mac Desktop	Chrome	NDF
1	820tgsjxq7	2011-05-25	20090523174809	NaN	MALE	38.0	facebook	0	en	seo	google	untracked	Web	Mac Desktop	Chrome	NDF
2	4ft3gnwmtx	2010-09-28	20090609231247	2010-08-02	FEMALE	56.0	basic	3	en	direct	direct	untracked	Web	Windows Desktop	IE	US
3	bjjt8pjhuk	2011-12-05	20091031060129	2012-09-08	FEMALE	42.0	facebook	0	en	direct	direct	untracked	Web	Mac Desktop	Firefox	other
4	87mebub9p4	2010-09-14	20091208061105	2010-02-18	-unknown-	41.0	basic	0	en	direct	direct	untracked	Web	Mac Desktop	Chrome	US

查看数据所包含的特征

print("Column names for training dataset : ")
for column in train.columns:
    print("-", column)

    Column names for training dataset : 
    - id
    - date_account_created
    - timestamp_first_active
    - date_first_booking
    - gender
    - age
    - signup_method
    - signup_flow
    - language
    - affiliate_channel
    - affiliate_provider
    - first_affiliate_tracked
    - signup_app
    - first_device_type
    - first_browser
    - country_destination

查看数据信息

train.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 213451 entries, 0 to 213450
    Data columns (total 16 columns):
    id                         213451 non-null object
    date_account_created       213451 non-null object
    timestamp_first_active     213451 non-null int64
    date_first_booking         88908 non-null object
    gender                     213451 non-null object
    age                        125461 non-null float64
    signup_method              213451 non-null object
    signup_flow                213451 non-null int64
    language                   213451 non-null object
    affiliate_channel          213451 non-null object
    affiliate_provider         213451 non-null object
    first_affiliate_tracked    207386 non-null object
    signup_app                 213451 non-null object
    first_device_type          213451 non-null object
    first_browser              213451 non-null object
    country_destination        213451 non-null object
    dtypes: float64(1), int64(2), object(13)
    memory usage: 26.1+ MB

1. train文件包含213451行数据，16个特征
2. 各特征的数据类型和空值情况
3. age空值较多，特征提取时考虑将空值单独作为一个类别
4. date_first_booking空值较多，在特征提取时可以考虑删除

`探索性分析`

date_account_created特征

dac_train = pd.to_datetime(train.date_account_created).value_counts()
dac_test = pd.to_datetime(test.date_account_created).

最低0.47元/天解锁文章