本篇主要说的是在【爬虫POI】一文中爬取的poi数据的基础上,将数据从mysql导入elasticsearch中,从而便于检索和查询。
安装 elasticsearch,elasticsearch-head,kibana以便可视化查看,如下图
一、定义索引
根据mysql表的字段信息,来定义es索引类型
| poi | CREATE TABLE `poi` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键ID',
`name` varchar(128) NOT NULL DEFAULT '' COMMENT '名称',
`province` varchar(64) NOT NULL DEFAULT '' COMMENT '省份',
`city` varchar(64) NOT NULL DEFAULT '' COMMENT '城市',
`district` varchar(64) NOT NULL DEFAULT '' COMMENT '区县',
`code` varchar(16) NOT NULL DEFAULT '' COMMENT '区号',
`phone_no` varchar(64) NOT NULL DEFAULT '' COMMENT '联系方式',
`region` varchar(64) NOT NULL DEFAULT '' COMMENT '所在区域',
`location` varchar(256) NOT NULL DEFAULT '' COMMENT '地址',
`category` varchar(128) NOT NULL DEFAULT '' COMMENT '大分类',
`sub_category` varchar(128) NOT NULL DEFAULT '' COMMENT '子分类',
`longitude` double NOT NULL DEFAULT '0' COMMENT '经度',
`latitude` double NOT NULL DEFAULT '0' COMMENT '纬度',
PRIMARY KEY (`id`),
KEY `idx_category` (`category`),
KEY `idx_name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=833412 DEFAULT CHARSET=utf8 COMMENT='poi数据' |
索引定义:
_index_mappings = {
"mappings": {
type: {
"properties": {
"poi_id": {
"type": "long",
},
"name": {
"type": "text",
},
"province": {
"type": "text",
},
"city": {
"type": "text",
},
"district": {
"type": "text",
},
"code": {
"type": "text",
},
"phone_no": {
"type": "text",
},
"region": {
"type": "text",
},
"location": {
"type": "text",
},
"category": {
"type": "text",
},
"coordinate": {
"type": "geo_point",
}
}
}
}
}
二、数据写入
根据定义的索引字段,将poi数据格式化,并批量(bulk)写入es
注意:由于poi数据设计经纬度信息,故定义了coordinate字段,并指定类型为geo_point, 从而为后续范围检索提供支持,geo_point 数据格式为 {lat, lon}
@taskManager.command
def insert_poi(name='location', type="poi"):
es = Elasticsearch(['127.0.0.1:9200'])
poi_id = 1000000
while poi_id > 1:
pois = Poi.query.filter(Poi.id < poi_id).order_by(Poi.id.desc()).limit(1000).all()
if pois:
actions = []
for poi in pois:
try:
result = resolve(poi.name)
poi_name = result[0] if result else ''
action = {
"_index": name,
"_type": type,
"_source": {
"poi_id": str(poi.id),
"name": str(poi_name),
"province": str(poi.province),
"city": str(poi.city),
"district": str(poi.district),
"code": str(poi.code),
"phone_no": str(poi.phone_no),
"region": str(poi.region),
"location": str(poi.location),
"category": str(poi.category),
"coordinate": {"lat": str(poi.latitude), "lon": str(
poi.longitude)}
}
}
actions.append(action)
except Exception as ex:
print(ex)
continue
# 批量处理
success, _ = bulk(es, actions, index=name, chunk_size=100, raise_on_error=True)
poi_id = pois[-1].id
print('poi insert done!')
# 处理name里的特殊字符
def resolve(poiname):
parsestr = parse.unquote(poiname)
pattern = r'[\u4e00-\u9fa5]+[A-Za-z]'
match = re.compile(pattern)
return match.findall(parsestr)
三、问题解决
由于对python不熟悉,在实现的过程中,遇到很多问题,下面一一列出:
1 mysql连接问题
SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:root@localhost:3306/scrapy?charset=utf8'
改为
SQLALCHEMY_DATABASE_URI = 'mysql+mysqlconnector://root:root@localhost:3306/scrapy?charset=utf8'
#原因:警告Warning: (1366, "Incorrect string value: '\\xD6\\...' for column ...
2 批量写入数据重复,即一条mysql数据在es里有很多条,具体原因未查明,如有知道原因的,烦请告知,解决办法,定义_id 为 mysql表自增ID,即在构建bulk写入数据时,指定_id,如下:
for poi in pois:
action = {
"_index": name,
"_type": type,
"_id": str(poi.id),
"_source": {
再次确认是bulk一行代码的位置问题,向外调整一层即可
3 object is not subscriptable的错误
这个问题主要是sqlalchemy查的的数据,直接下标索引取值导致的,使用对象方式获取即可解决
"name": str(poi.name).replace("-", "/"),
4 索引类型定义
es6+不再有string,统一text
"name": {
"type": "text",
},
5 flask中自定义command
借助Manager实现,创建索引和导入数据,都是做成了command,方便使用
config_name = os.environ.get('FLASK_CONFIG') or 'default'
# 生成app
app = create_app(config_name)
app.config['DEBUG'] = True
taskManager = Manager(app)
@taskManager.command
def create_index(name='location', type="poi"):
@taskManager.command
def insert_poi(name='location', type='poi')
python command.py create_index
python command.py insert_poi
四、具体代码
代码目录在github代码仓库,请点击查询
https://github.com/HelloMrShu/Python/blob/master/flask/command.py
参考文档
【flask-script】https://flask-script.readthedocs.io/en/latest/
【Flask-SQLAlchemy中解决1366报错】https://segmentfault.com/a/1190000010596306
【scrapy poi 爬虫详解】https://blog.csdn.net/cjqh_hao/article/details/96469465
【elasticsearch+kibana数据可视化】 https://blog.csdn.net/cjqh_hao/article/details/86697533