【大数据分析】酒店舆情分析建模

实验环境:pycharm,tableau

一、数据预处理

在获得相关数据后,需要对数据做一些清洗和预处理,才可以做进一步的可视化和建模。

此案例共有4个数据集:

数据集

包含信息

hotel_urls_ids

酒店URL与标识符的映射

hotel_profiles

酒店信息

hotel_reviews

酒店评论

geo_mappings

用于地理位置省市区映射

表 1 实验一数据集

使用到的数据包:

import pandas as pd

import numpy as np

import jieba, re, graphviz

import jieba.posseg as pseg

import warnings

warnings.filterwarnings('ignore')

1.数据导入

读取数据集hotel_urls_ids,hotel_profiles,hotel_reviews,geo_mappings,查看数据集的前几行。

hotel_urls_ids = pd.read_csv("D:/data/Qunar/hotel_urls_ids.csv", names = ["hotel_url", "hotel_id"])

print(hotel_urls_ids.head(5))
hotel_profiles = pd.read_csv("D:/data/Qunar/hotel_profiles.csv", names = ["hotel_url", "city", "name", "address", "score", "open_date", "room_count"])

print(hotel_profiles.head(5))
hotel_reviews = pd.read_csv("D:/data/Qunar/hotel_reviews.csv", names = ["hotel_id", "date", "title", "content", "score"])

print(hotel_reviews.head(5))
geo_mappings = pd.read_csv("D:/data/Qunar/geo_mappings.csv", names = ["province", "city", "county", "longitude", "lattitude"])

print(geo_mappings.head(5))

2.地理信息数据处理

对省和市的名称做一些清洗,并去除省和市映射的重复值。

geo_mappings['city'] = geo_mappings['city'].str.replace("市|地区|自治州", "",regex=True)

geo_mappings['province'] = geo_mappings['province'].str.replace("市|省|自治区|壮族|维吾尔|回族", "",regex=True)

geo_mappings = geo_mappings[['city','province']].drop_duplicates()

将酒店URL与标识符的映射(数据集hotel_urls_ids)、酒店信息(数据集hotel_profiles)和省市区映射(数据集geo_mappings)按照酒店标识符(变量hotel_id)或酒店URL(变量hotel_url)做连接。

hotel_profiles = pd.merge(hotel_urls_ids, hotel_profiles).drop('hotel_url',1)

hotel_profiles = pd.merge(hotel_profiles, geo_mappings)

hotel_profiles = hotel_profiles[["hotel_id", "province", "city", "name", "score","open_date", "room_count"]]

删除部分无用变量,并将开业时间(变量open_date)和房间数(变量room_count)做清洗,转换成数值类型。

hotel_profiles['open_date'] = hotel_profiles['open_date'].str.slice(0,4)

hotel_profiles['room_count'] = hotel_profiles['room_count'].str.slice(stop=-1)

print(hotel_profiles.head(5))

将结果存储成CSV文件。

hotel_profiles.to_csv("out1.csv", index = False)

将之前连接得到的酒店信息(数据集hotel_profiles)和酒店评论(数据集hotel_reviews)按照酒店标识符(变量hotel_id)做连接,并将结果存储成CSV文件。

review_all = pd.merge(hotel_profiles, hotel_reviews, on = 'hotel_id')[["hotel_id", "province", "city", "date", "score_y"]]

review_all.to_csv("out2.csv", index = False)

3.中文分词和关键词统计

结巴分词可以用于将评论切分成中文词语并做词性标注。将所有评论按不同评分(1到5)做分组,将每组中的评论分词并做词性标注,仅保留名词和动词(即词性为n、v、vd和vn)。返回数据框,包含评分(变量score)、词语(变量word)和词频(变量count)。

[str(i) for i in pseg.cut(hotel_reviews.loc[0,'content'])]

score_word_count = pd.DataFrame({'score' : [], 'word' : [], 'count' : []})

for i in range(5):

    contents = hotel_reviews.loc[hotel_reviews['score'] == i+1, 'content'].astype(str)

    words = [word for content in contents for word, flag in list(pseg.cut(content)) if flag in ["n","v","vd","vn"]]

    word_count = pd.Series(words).value_counts()

    score_word_count = score_word_count.append(pd.DataFrame(

        {'score' : [i+1] * len(word_count), 'word' : word_count.index, 'count' : word_count}).reset_index(drop = True), sort = False)

#将结果存储成CSV文件。

score_word_count['score'] = score_word_count['score'].astype('int32')

score_word_count['count'] = score_word_count['count'].astype('int32')

print(score_word_count.head(5))

score_word_count.to_csv("out3.csv", index = False)
评论提供了有关酒店的大量信息。这个数据可用于许多nlp项目:推荐系统,情绪分析,同类酒店的图网,基于评论的酒店细分。该数据集包含25个城市的酒店列表和评论。 file/opensearch/documents/92885/hotelReviewsInAustin__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBali__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBangkok__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBarcelona__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBombay__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInChicago__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInDubai__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInHong Kong__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInIstanbul__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInLondon__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInMiami__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInMilan__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInNew York__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInOsaka__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInParis__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInPhuket__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInPrague__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInRome__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInSan Francisco__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInSantorini__en2019100120191005.csv file/opense
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值