先看看我们今天要处理的数据,爬取的成果:
工作信息7000多条
租房信息差不多6w条
本章任务:
1、去重工作地址,获取地址坐标(高德)
2、去重公司,获取公司情况
3、工作筛选,由于智联上爬取的工作,好多事工作内容里有python,只用python谢谢脚本啥的,实际标题上找的是java啊,php啊,这样的工作要删掉
4、剔除数据异常值,过大过小,NA值
5、工作根据,工作,公司情况,工作内容,技能要求等进行评分,目的是找到适合我的工作
开整:
先导入数据看一下
import pandas as pd
import numpy as np
import pymongo
client = pymongo.MongoClient("mongodb://XX:XXXXX@192.168.3.7:2018",connect=False)
db = client["test"]
table = db["python"]
df = pd.DataFrame(list(table.find()))
del df["_id"]
df.head()
这样事儿的:
根据工作名称剔除掉不适合我的工作:带java,php,web,C,C++这种
name_ban = ["linux","php","Linux","PHP","JAVA","java","Java","DBA","运维","web","WEB","实习生","C","C++","培训","R","Golang"]
差不多这些吧,应用apply 把标题中含有这些的换成na ,然后drop
def ban_name(job_name):
name_ban = ["linux","php","Linux","PHP","JAVA","java","Java","DBA","运维","web","WEB","实习生","C","C++","培训","R","Golang"]
if any(x in job_name for x in ["python","Python"]):
pass
else:
if any(x in job_name for x in name_ban):
job_name = np.nan
return job_name
df["job_name"] = df["job_name&