简单Elasticsearch实战(一)介绍
简单Elasticsearch实战(二)python爬取招聘网站信息
简单Elasticsearch实战(三)python连接Elasticsearch
简单Elasticsearch实战(四)数据清洗后,从mysql导入Elasticsearch
简单Elasticsearch实战(五)利用kabana做简单数据分析
本文只是简单做个例子,少量数据从mysql导入es还可以,大量数据的话还是建议专业工具来做,或者优化一下,改为多线程
数据清洗
首先,我们看一下,之前获取到的数据都是这样的,很不规范,而且有很多空值,残缺值,这就需要我们二次处理一下了。
首先我在Mysql.py加一条函数
这里我们使用pymysql的SSCursor来获取数据,这样做的好处是,他不会一次性把全部数据读出来,造成大量内存占用。对数据库数据多的情况下很有用。
import pymysql
from common.Logger import Logger, DBLog
log = DBLog().getLog()
class DbManager:
conn = pymysql.connect('106.13.114.199', user="root", passwd="Woai1998!", db="job51")
if conn is not None:
log.logger.info("successful connect mysql" + "106.13.114.199")
# id开始的id,to结束的id,这两个参数用于区间获取数据
def selectWantedCursor(self,id,to):
sql = "SELECT * FROM `job51`.`wanted` where id >=" +str(id) +" and id <=" +str(to)
cursor = pymysql.cursors.SSCursor(self.conn)
cursor.execute(sql)
return cursor
主方法
from bs4 import BeautifulSoup
from elasticsearch_dsl import connections
from db.Mysql import DbManager
from common.Logger import ControllerLog, DataLog, Logger
from elasticsearchTest import Company, Address, Job
log = ControllerLog().getLog()
if __name__ == "__main__":
connections.create_connection(hosts=['106.13.114.199'])
start(0,9999999)
start 函数
该函数通过调用数据库逐条获取招聘信息,然后通过parseRow函数处理信息,并返回job类,最后执行job.save()存入Elasticsearch
def start(id,to):
log.logger.info("start loader to es")
log.logger.info("start id is" + str(id))
# 获取数据库游标
cursor = dbManager.selectWantedCursor(id=id,to=to)
row = cursor.fetchone()
# 遍历数据库信息
while True:
# 如果这个数据没有获取到,则该条数据很可能已经失效
# 直接跳过
if not row[11]:
log.logger.warn("招聘信息"+str(row[0])+"已经失效")
row = cursor.fetchone()
continue
log.logger.info("current is parse " + str(row[0]))
job = parseRow(row) # 清洗job信息
stat = job.save() # 存入Elasticsearch,返回存入状态
if "create" not in stat:
log.logger.error(str(row[0])+"该数据已存在"+stat)
log.logger.info("the row is "+str(stat))
row = cursor.fetchone()
if not row:
break
parseRow函数
该函数就是通过获取数据库的字段信息,保存在job里面,然后返回,parseCompany,parsJobDes等函数,都是由于原数据不规范,需要梳理一下返回,就不过多解释了,在你自己的项目里要根据实际情况修改
def parseRow(row):
job = Job(meta={'id': row[0]})
job.url = row[7]
job.job_name = row[1]
job.company = parseCompany(row)
job.job_description = parsJobDes(row[13])
job.work_experience = row[9]
job.welfare = parseTags(row[14])
job.education = row[10]
job.job_type = row[5]
job.work_address = splitAddr(row[3])
job.salary = parseSalary(row[4])
job.date = row[8]
return job
其余清洗函数
def parseCompany(row):
company = Company()
company.company_name = row[2]
companyType = row[11].split("|")
if len(companyType) >= 3:
company.company_property = companyType[0] or "未知"
company.number_person = companyType[1] or "未知"
tags = companyType[2].split(" ")
company.company_tag = tags
elif len(companyType)>=2:
company.company_property = companyType[0] or "未知"
tags = companyType[1].split(" ")
company.company_tag = tags
else:
company.company_property = companyType[0] or "未知"
return company
def splitAddr(param):
split = param.split("-")
address = Address()
if len(split) >= 2:
address.city = split[0]
address.area = split[1]
else:
address.city = split[0]
return address
def parseSalary(param):
# split = param.split(str="-")
return param
def parsJobDes(param):
soup = BeautifulSoup(param, features="html.parser")
root = soup.find("article")
tags = []
for item in root:
if "职位信息" in str(item.string) \
or "岗位职责" in str(item.string) \
or "任职条件" in str(item.string) \
or item.string is None:
continue
if len(str(item.string).replace(" ", "").replace('\n', '').replace('\r', '')) == 0:
continue
tags.append(str(item.string))
return tags
def parseTags(param):
if param == "无":
return []
soup = BeautifulSoup(param, features="html.parser")
root = soup.find("div")
tags = []
for item in root:
if item.string is None:
continue
if len(str(item.string).replace(" ", "").replace('\n', '').replace('\r', '')) == 0:
continue
tags.append(str(item.string))
return tags