实验环境:pycharm,tableau
一、数据预处理
在获得相关数据后,需要对数据做一些清洗和预处理,才可以做进一步的可视化和建模。
此案例共有4个数据集:
数据集 | 包含信息 |
hotel_urls_ids | 酒店URL与标识符的映射 |
hotel_profiles | 酒店信息 |
hotel_reviews | 酒店评论 |
geo_mappings | 用于地理位置省市区映射 |
表 1 实验一数据集
使用到的数据包:
import pandas as pd
import numpy as np
import jieba, re, graphviz
import jieba.posseg as pseg
import warnings
warnings.filterwarnings('ignore')
1.数据导入
读取数据集hotel_urls_ids,hotel_profiles,hotel_reviews,geo_mappings,查看数据集的前几行。
hotel_urls_ids = pd.read_csv("D:/data/Qunar/hotel_urls_ids.csv", names = ["hotel_url", "hotel_id"])
print(hotel_urls_ids.head(5))
hotel_profiles = pd.read_csv("D:/data/Qunar/hotel_profiles.csv", names = ["hotel_url", "city", "name", "address", "score", "open_date", "room_count"])
print(hotel_profiles.head(5))
hotel_reviews = pd.read_csv("D:/data/Qunar/hotel_reviews.csv", names = ["hotel_id", "date", "title", "content", "score"])
print(hotel_reviews.head(5))
geo_mappings = pd.read_csv("D:/data/Qunar/geo_mappings.csv", names = ["province", "city", "county", "longitude", "lattitude"])
print(geo_mappings.head(5))
2.地理信息数据处理
对省和市的名称做一些清洗,并去除省和市映射的重复值。
geo_mappings['city'] = geo_mappings['city'].str.replace("市|地区|自治州", "",regex=True)
geo_mappings['province'] = geo_mappings['province'].str.replace("市|省|自治区|壮族|维吾尔|回族", "",regex=True)
geo_mappings = geo_mappings[['city','province']].drop_duplicates()
将酒店URL与标识符的映射(数据集hotel_urls_ids)、酒店信息(数据集hotel_profiles)和省市区映射(数据集geo_mappings)按照酒店标识符(变量hotel_id)或酒店URL(变量hotel_url)做连接。
hotel_profiles = pd.merge(hotel_urls_ids, hotel_profiles).drop('hotel_url',1)
hotel_profiles = pd.merge(hotel_profiles, geo_mappings)
hotel_profiles = hotel_profiles[["hotel_id", "province", "city", "name", "score","open_date", "room_count"]]
删除部分无用变量,并将开业时间(变量open_date)和房间数(变量room_count)做清洗,转换成数值类型。
hotel_profiles['open_date'] = hotel_profiles['open_date'].str.slice(0,4)
hotel_profiles['room_count'] = hotel_profiles['room_count'].str.slice(stop=-1)
print(hotel_profiles.head(5))
将结果存储成CSV文件。
hotel_profiles.to_csv("out1.csv", index = False)
将之前连接得到的酒店信息(数据集hotel_profiles)和酒店评论(数据集hotel_reviews)按照酒店标识符(变量hotel_id)做连接,并将结果存储成CSV文件。
review_all = pd.merge(hotel_profiles, hotel_reviews, on = 'hotel_id')[["hotel_id", "province", "city", "date", "score_y"]]
review_all.to_csv("out2.csv", index = False)
3.中文分词和关键词统计
结巴分词可以用于将评论切分成中文词语并做词性标注。将所有评论按不同评分(1到5)做分组,将每组中的评论分词并做词性标注,仅保留名词和动词(即词性为n、v、vd和vn)。返回数据框,包含评分(变量score)、词语(变量word)和词频(变量count)。
[str(i) for i in pseg.cut(hotel_reviews.loc[0,'content'])]
score_word_count = pd.DataFrame({'score' : [], 'word' : [], 'count' : []})
for i in range(5):
contents = hotel_reviews.loc[hotel_reviews['score'] == i+1, 'content'].astype(str)
words = [word for content in contents for word, flag in list(pseg.cut(content)) if flag in ["n","v","vd","vn"]]
word_count = pd.Series(words).value_counts()
score_word_count = score_word_count.append(pd.DataFrame(
{'score' : [i+1] * len(word_count), 'word' : word_count.index, 'count' : word_count}).reset_index(drop = True), sort = False)
#将结果存储成CSV文件。
score_word_count['score'] = score_word_count['score'].astype('int32')
score_word_count['count'] = score_word_count['count'].astype('int32')
print(score_word_count.head(5))
score_word_count.to_csv("out3.csv", index = False)