使用Python结合Elasticsearch对CSV文件内容进行分词

最新推荐文章于 2023-05-03 12:54:15 发布

鬼义II虎神

最新推荐文章于 2023-05-03 12:54:15 发布

阅读量1.7k

点赞数 1

分类专栏： Python进阶学习笔记文章标签： Python ElasticSearch CSV

本文链接：https://blog.csdn.net/u013487601/article/details/103262667

版权

Python进阶学习笔记专栏收录该内容

13 篇文章 1 订阅

订阅专栏

面试官出的编程题，用Python实现Elasticsearch对CSV文件的分词。

1. 环境搭建（`Windows`）

1.1 `Python`安装

略

1.2 安装`Elasticsearch`（全文搜索引擎）和`Kibana`（管理工具）

https://www.elastic.co/cn/downloads/

1.3 安装`IK`分词插件

https://github.com/medcl/elasticsearch-analysis-ik/releases
注意：版本要和ES对应，解压到ES安装目录的plugins目录中即可
详细参考：
https://www.cnblogs.com/zmc940317/articles/10646222.html

1.4 启动`ES`

安装目录bin目录下双击elasticsearch.bat，浏览器访问网址http://localhost:9200

1.5 启动`Kibana`

安装目录bin目录下双击kibanab.bat，浏览器访问网址http://localhost:5601

2. 代码实现

方法虽然很多，但是一个一个看还是很简单的，不要急躁。
没有ES基础的可以先去看这个视频：https://www.bilibili.com/video/av44221410，里面讲的很清楚。


# -*- coding:UTF-8 -*-
"""
程序：使用ElasticSearch操作CSV中的数据
版本：1.0
作者：鬼义虎神
日期：2019年11月14日
参考视频文档：
    1.https://www.cnblogs.com/shaosks/p/7592229.html ES对CSV操作
    2.https://www.cnblogs.com/fighter007/p/9351688.html 读取CSV
    3.https://blog.csdn.net/chenftli/article/details/83000859 Python操作CSV
    4.https://www.bilibili.com/video/av44221410 ES安装到语法到Python操作
"""
# 导入CSV标准库
import csv
# 导入ES库（需要先 pip install elasticsearch）
from elasticsearch import Elasticsearch


# 创建索引
def create_index():
    _index_mappings = {
        "mappings": {
            "dynamic": "true",  # 设置索引为“动态的”
            "properties": {
                "ExpertID": {
                    "type": "long"    # 对应int
                },
                "fieldID": {
                    "type": "long"
                },
                "姓名": {
                    "type": "text"  # 对应string
                },
                "单位": {
                    "type": "text",
                    "analyzer": "ik_max_word"  # 使用IK分词（较细粒度分词）
                },
                "学历": {
                    "type": "text",
                    "analyzer": "ik_max_word"  
                },
                "学位": {
                    "type": "text",
                    "analyzer": "ik_max_word"  
                },
                "职务": {
                    "type": "text",
                    "analyzer": "ik_max_word"  
                },
                "职称": {
                    "type": "text",
                    "analyzer": "ik_max_word"  
                },
                "学科": {
                    "type": "text",
                    "analyzer": "ik_max_word"  
                },
                "工作电话": {
                    "type": "text"
                },

                "地址": {
                    "type": "text",
                    "analyzer": "ik_smart"  # 使用IK分词（较粗粒度分词）
                },
                "研究方向": {
                    "type": "text",  
                    "analyzer": "ik_max_word"  
                },

                "所属部门": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                },
                "身份": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                },
                "邮箱": {
                    "type": "text"
                },
                "领域": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                },
                "项目": {
                    "type": "text",
                    "analyzer": "ik_smart"  # 粗粒度分词
                }
            }
        }
    }
    # 判断索引是否存在
    if es.indices.exists(index=my_index) is not True:
        # 创建索引，指定索引名、索引·映射·
        res = es.indices.create(index=my_index, body=_index_mappings)
        print("无索引，创建它:\n", res)
    else:
        print("索引已经存在，跳过创建！")


# 删除索引
def delete_index():
    # 判断索引是否存在
    if es.indices.exists(index=my_index) is True:
        res = es.indices.delete(index=my_index)


# **从CSV文件中读取数据，并存储到es中**
def index_data_from_csv(csvfile):
    """
    :param csvfile: csv文件，包括完整路径
    """
    # 取CSV数据
    with open(csvfile, 'r', encoding='utf-8', newline='') as f:
        list_data = csv.reader(f)
        # 循环成字典形式，插入数据
        index = 0
        doc = {}
        # 将每一行数据拼接成字典
        for item in list_data:
            if index > 1:  # 第一行是标题
                doc['ExpertID'] = item[0]
                doc['fieldID'] = item[1]
                doc['姓名'] = item[2]
                doc['单位'] = item[3]
                doc['学历'] = item[4]
                doc['学位'] = item[5]
                doc['职务'] = item[6]
                doc['职称'] = item[7]
                doc['学科'] = item[8]
                doc['工作电话'] = item[9]
                doc['地址'] = item[10]
                doc['研究方向'] = item[11]
                doc['所属部门'] = item[12]
                doc['身份'] = item[13]
                doc['邮箱'] = item[14]
                doc['领域'] = item[15]
                doc['项目'] = item[16]
                # 建立索引的数据，指定索引名、类型名、内容
                res = es.index(index=my_index, doc_type=my_doc_type, body=doc)
                print(res)
            # 下面没用，就是看看一共读了多少行
            index += 1
            print(index)


# 查询所有索引数据
def query_all_index_data():
    query_all_body = {
        "query": {
            "match_all": {}
        },
        "from": 0,
        "size": 100
    }
    all_data = es.search(index=my_index, body=query_all_body)
    print(all_data)


# 根据ID查询索引数据
def get_data_id(id):
    res = es.get(index=my_index, doc_type=my_doc_type, id=id)
    print(res)


# 根据请求体进行查询查询，这里就可以使用各种ES的语法了
def get_data_by_body():
    query_body = {
        "query": {
            "match": {
                "地址": "大学"
            }
        }
    }

    res = es.search(index=my_index, body=query_body)
    print(res)


# 插入数据
def insert_index_data():
    list_data = [
        {'ExpertID': '10099',
         'fieldID': '8',
         '姓名': '李凡',
         '单位': '中科院计算技术研究所',
         '学历': '博士后',
         '学位': '博士',
         '职务': '博导',
         '职称': '总监',
         '学科': 'Python',
         '工作电话': 'nan',
         '地址': 'nan',
         '研究方向': 'ES',
         '所属部门': 'Python开发',
         '身份': '中科院计算技术研究所主研究员',
         '邮箱': 'lifan775269525@foxmail.com',
         '领域': '航天',
         '项目': '青年基金、中国科学院知识创新工程子课题、国家242信息安全计划国家863计划'},
        {'ExpertID': '10100',
         'fieldID': '9',
         '姓名': '李凡1',
         '单位': '111中科院计算技术研究所',
         '学历': '博士11后',
         '学位': '博11士',
         '职务': '1111',
         '职称': '总111监',
         '学科': 'Pyt1111hon',
         '工作电话': 'nan',
         '地址': 'nan',
         '研究方向': 'E11S',
         '所属部门': 'Py11111thon开发',
         '身份': '中科院1111计算技术研究所主研究员',
         '邮箱': 'lifa111n775269525@foxmail.com',
         '领域': '11航天',
         '项目': '青年1111基金、中国科学院知识创新工程子课题、国家242信息安全计划国家863计划'},
    ]
    # 这里跟CSV读取其实一样，只是内容的字典是自己打出来，CSV是拼接出来的
    for item in list_data:
        res = es.index(index=my_index, doc_type=my_doc_type, body=item)
        print(res)


# 删除索引中的一条
def delete_index_data_by_id(id):
    '''
    删除索引中的一条
    :param id:
    '''
    res = es.delete(index='con_pro', doc_type='_doc', id=id)
    print(res)


# 根据请求体查询出的结果删除数据
def delete_index_data_by_body():
    query_body = {
        "query": {
            "match": {
                "地址": "大学"
            }
        }
    }
    res = es.delete_by_query(index=my_index, body=query_body)
    print(res)


if __name__ == '__main__':
    # 0、索引con_pro 类型_doc
    my_index = 'con_pro'
    my_doc_type = '_doc'

    # 1、建立Python与ES的连接
    es = Elasticsearch(['127.0.0.1:9200'])

    # 2、创建索引
    create_index()

    # 3、导入CSV数据
    # index_data_from_csv('./con_experts_pro_3.csv')

    # 4、插入数据（增）
    # insert_index_data()

    # 查询所有0-100条数据（查）
    # query_all_index_data()

    # 根据ID查询（查）
    # get_data_id("ox2aaG4Bbd3iU8yReS")

    # 根据请求体查询（查）  --根据请求体body完成各种查询
    # get_data_by_body()

    # 根据ID删除（删）
    # delete_index_data_by_id('ox2aaG4Bbd3iU8yReS-J')

    # 根据查询结果删除（删）
    # delete_index_data_by_body()

    # 删除索引
    delete_index()

当时面试官说ElasticSearch的时候我还以为很简单呢，以前项目使用的是Django结成ES，按照步骤直接配置就成，根本没怎么玩儿过原生的ES索引、语句这些东西。
然后当天回去就慌了，网上各种帖子也看不进去，心情极度烦躁，接着晚上从0开始看ES的教程视频，熬到凌晨3:30才看完，第二天早早起来继续弄，最后结果还是很开心的。
希望能干长远。

CSV文件内容自己保存一下（就粘几条内容，避免泄露）

ExpertID,fieldID,姓名,单位,学历,学位,职务,职称,学科,工作电话,地址,研究方向,所属部门,身份,邮箱,领域,项目
10000,6,陈啊霖,中科院计算所,博士研究生,博士,博导,研究员,计算机应用技术,nan,nan,多模式人机交互，多媒体技术，图像理解，模式识别 ,智能信息重点实验室,中科院计算所研究员,xlchen[at]ict.ac.cn,计算机视觉,多媒体技术以及多模式人机接口。先后主持过自然科学基金重点项目
10001,8,蒋啊强,中科院计算技术研究所,博士研究生,博士,博导,研究员,计算机应用技术,nan,nan,图像/视频等多媒体信息的分析、理解、监控、检索等技术的研究开发 ,智能信息处理重点实验室,中科院计算技术研究所副研究员,sqijang[at]ict.ac.cn,多媒体技术,国家自然科学基金优秀青年科学基金、国家自然科学基金面上项目、国家自然科学基金重点项目子课题、国家自然科学基金青年基金、中国科学院知识创新工程子课题、国家242信息安全计划国家863计划

鬼义II虎神

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
0
评论
使用Python结合Elasticsearch对CSV文件内容进行分词

面试官出的编程题，用Python实现Elasticsearch对CSV文件的分词。1. 环境搭建（Windows）1.1 Python安装略1.2 安装Elasticsearch（全文搜索引擎）和Kibana（管理工具）https://www.elastic.co/cn/downloads/1.3 安装IK分词插件https://github.com/medcl/elasticsea...
复制链接

扫一扫