数据分析之数据清洗(四)

旅游招聘数据分析之数据清洗(四)

在获取完我们的数据之后,就需要我们对数据进行清洗了,这个是一件很头疼的事情,麻烦,工作量大,首先我们先对我们的数据进行查重,毕竟那么多网站,有很多重复的,这些数据不仅没用而且还会增加我们的工作量,浪费时间,所以首先第一步就是查重了。建议最好先把全部数据放到一个Excel文件里面

import pandas as pd
data= pd.DataFrame(pd.read_excel('数据大集成.xlsx','Sheet1'))
no_re_row = data.drop_duplicates()
print(no_re_row)
no_re_row.to_excel("新(数据大集成).xls")

然后我们到查重好的文件里面,先将里面的部分内容进行复制,因为,数据量太大,文本文件大小有限制(好像超过3-4M之后就容易出错,所以数据量不易过大),超过这个范围,文本文件就容易出错,这个不是我们想要的结果,所以最好就是一步步复制,我这里是将Excel表进行一列列复制,到一个文本文件里面然后再写一个程序,把里面不需要,多余的文字进行删除,这里最好一部分一部分代码运行,不然一下子全部运行,容易发生冲突,这样就容易数据出错

import re
def clearBlankLine():
    file1 = open('整理文件.txt', 'r',encoding="utf-8")
    file2 = open('整理好的内容.txt', 'w', encoding='utf-8')
    try:
        for line in file1.readlines():
            file2.write(line)
            # line = line.replace("英语","").replace("薪聘","").replace("高","").replace("顺德区","").replace("旅游销售\\\\","") \
            #     .replace("去哪儿", "").replace("网","").replace("旅游在线","").replace("客服","").replace("门店","").replace("全球","") \
            #     .replace("包住宿", "").replace("月薪","").replace("携程","").replace("8500","").replace("8K","").replace("旅游产品专员\\\\","") \
            #     .replace("急聘", "").replace("康养","").replace("7000+","").replace("酒店","").replace("薪","").replace("招聘","") \
            #     .replace("同业", "").replace("无经验","").replace("+","").replace("轻松","").replace("工作","").replace("旅行海外","") \
            #     .replace("同业", "").replace("无经验", "").replace("+", "").replace("轻松", "").replace("工作", "").replace(
            #     "旅行海外", "") \
            #     .replace("泰语", "").replace("客服", "").replace("主管", "").replace("门店", "").replace("扶贫", "").replace(
            #     "旅行海外", "") \
            #     .replace("高铁", "").replace("资深", "").replace(",", "").replace("轻松", "").replace("工作", "").replace(
            #     "旅行海外", "") \
            #     .replace("同业", "").replace("无经验", "").replace("+", "").replace("轻松", "").replace("工作", "").replace(
            #     "旅行海外", "") \
                # line = line.replace("0.1","1").replace("0.2","2").replace("0.3","3").replace("0.4","4") \
            #     .replace("0.5", "5").replace("0.6","6").replace("0.7","7").replace("0.8","8").replace("0.9","9")
            # line = line.replace("'","").replace("\\xa0","").replace("***","").\
            #     replace(",","").replace("★","").replace("◆","").replace("(","").\
            #     replace(")","").replace("【","").replace("】","").replace("\\n","").\
            #     replace('[','').replace(']',"").replace("...","").replace('\\\\',"/")
            # line = line.strip(" ")
            # line = line.replace("1.20k","12k").replace("1.40k","14k").replace\
            #     ("1.50k","15k").replace("1.30k","13k").replace("10-150k/年","10-15k").\
            #     replace("8-150k/年","8-15k").replace("15-200k/年","15-20k").\
            #     replace("200元/天","6k").replace("1.5千以下","1.5k").\
            #     replace("300元/天","9k").replace("100元/天","3k").replace("150元/天","4.5k").\
            #     replace("7-120k/年","7-12k").replace("30-400k/年","3-4k").replace("20-300k/年","2-30k").replace("8-100k/年","8-10k").replace("8-200k/年","8-20k")
            # line = line.replace("1-","1k-").replace("2-","2k-").replace("3-","3k-").\
            #     replace("6-","6k-").replace("5-","5k-").replace("4-","4k-").replace("7-","7k-").replace("8-","8k-").replace("9-","9k-").replace("0-","0k-")
            # line = line[0:2]
            # line = line.replace("哈尔","哈尔滨").replace("大兴","大兴安岭").replace("防城","防城港").replace("呼和","呼和浩特").\
            #     replace("呼伦","呼伦贝尔").replace("葫芦","葫芦岛").replace("红河","红河州").replace("景德","景德镇").replace("克拉","克拉玛依")\
            #     .replace("喀什","喀什地区").replace("马鞍","马鞍山").replace("牡丹","牡丹江").replace("秦皇","秦皇岛").replace("齐齐","齐齐哈尔").\
            #     replace("七台","七台河").replace("黔东","黔东南").replace("石家","石家庄").replace("神农","神农架").replace("双鸭","双鸭山")\
            #     .replace("石河","石河子").replace("图木","图木舒克").replace("五指","五指山").replace("乌鲁","乌鲁木齐").replace("西双","西双版纳")\
            #     .replace("张家","张家界").replace("驻马","驻马店")
            # line = line.replace("0k","0").replace("1k","1").replace("2k","2").replace("3k","3").replace("4k","4")\
            #     .replace("5k","5").replace("6k","6").replace("7k","7").replace("8k","8").replace("9k","9")
            # line = line.replace("1-1","1-1千/月").replace("1-2","1-2千/月").replace("1-3","-3千/月").replace("1-4","1-4千/月").\
            #     replace("1-5","1-5千/月").replace("1-6","1-6千/月").replace("1-7","1-7千/月").replace("1-8","1-8千/月").\
            #     replace("1-9","1-9千/月").replace("2-3","2-3千/月").replace("2-4","2-4千/月").\
            #     replace("2-5","2-5千/月").replace("2-6","2-6千/月").replace("2-7","2-7千/月").replace("2-8","2-8千/月").\
            #     replace("2-9","2-9千/月").replace("3-4","3-4千/月").\
            #     replace("3-5","3-5千/月").replace("3-6","3-6千/月").replace("3-7","3-7千/月").replace("3-8","3-8千/月").\
            #     replace("3-9","3-9千/月").\
            #     replace("5-5","5-5千/月").replace("5-6","5-6千/月").replace("5-7","5-7千/月").replace("5-8","5-8千/月").\
            #     replace("5-9","5-9千/月").replace("6-6","6-6千/月").replace("6-7","6-7千/月").replace("6-8","6-8千/月").\
            #     replace("6-9","6-9千/月").replace("7-7","7-7千/月").replace("7-8","7-8千/月").\
            #     replace("7-9","7-9千/月").replace("8-8","8-8千/月").\
            #     replace("8-9","8-9千/月").\
            #     replace("4-5","4-5千/月").replace("4-6","4-6千/月").replace("4-7","4-7千/月").replace("4-8","4-8千/月").\
            #     replace("4-9","4-9千/月")
            # line = line.replace("20-4万/月0","2-4万/月")\
            #     .replace("-35","-3.5万/月").replace("-38","-3.8万/月").replace("-55","-5.5万/月").replace("-25","-2.5万/月").\
            #     replace("-36","-3.6万/月").replace("-26","-2.6万/月").replace("-27","-2.7万/月").replace("-28","-2.8万/月").\
            #     replace("-29","-2.9万/月").replace("-24","-2.4万/月").replace("-23","-2.3万/月").replace("-22","-2.2万/月").\
            #     replace("-21","-2.1万/月").replace("-31","-3.1万/月").replace("-32","-3.2万/月").replace("-33","-3.3万/月").\
            #     replace("-34","-3.4万/月").replace("-37","-3.7万/月").replace("-39","-3.9万/月").replace("-50","-5万/月").\
            #     replace("-60","-6万/月")
            # line = line.replace("-10","-1万/月").replace("-11","-1.1万/月").replace("-12","-1.2万/月").replace("-13","-1.3万/月").\
            #     replace("-14","-1.4万/月").replace("-15","-1.5万/月").replace("-16","-1.6万/月").replace("-17","-1.7万/月").replace("-18","-1.8万/月").\
            #     replace("-19","-1.9万/月").replace("-20","-2万/月").replace("-30","-3万/月").replace("-40","-4万/月")
            # line = line.replace("2千/月.2万/月","2.2万/月").replace("千/月万/月","万/月").\
            #     replace("1千/月.2万/月","1.2万/月").replace("1千/月.5万/月","1.5万/月").replace("3千/月.5万/月","3.5万/月")

    finally:
        file1.close()
        file2.close()


if __name__ == '__main__':
    clearBlankLine()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

清洗好的数据之后就是用专门的数据分析工具来,当然也可以用python来做数据分析,这样更精准一点,但是本人比较懒,就直接用数据分析工具了,这里我采用的是tableau publish,我之所以选择它,是因为它简单易上手,而且弄出来的图,清晰可见,易懂,非常好用,这里我要算的是全部旅游数据,什么类型的公司占百分比,和工作地点百分比,以及公司规模百分比

在这里插入图片描述

在这里我们直接鼠标移到饼图就可以看清楚,某个东西占的比重,方便我们做数据分析。

这里就是全部思路了,代码的话,可以去我的GitHub账户上面把源代码下载下来,如果对你有帮助的话,不嫌麻烦,可以在我的GitHub点一下start,你的支持是我更新的动力

比重,方便我们做数据分析。

这里就是全部思路了,代码的话,可以去我的GitHub账户上面把源代码下载下来,如果对你有帮助的话,不嫌麻烦,可以在我的GitHub点一下start,你的支持是我更新的动力

数据分析的结果和代码

数据分析之前程无忧(一)

数据分析之大街网(二)

数据分析之拉勾网(三)

数据分析之数据清洗(四)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

有猫腻妖

你的鼓励是我更新的动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值