使用python生成大量数据写入es数据库并查询操作(聚合)

模拟学生成绩信息写入es数据库,包括姓名、性别、科目、成绩。

示例代码1:  【一次性写入10000*1000条数据】  【本人亲测耗时5100秒】

from elasticsearch import Elasticsearch
from elasticsearch import helpers
import random
import time

es = Elasticsearch(hosts='http://127.0.0.1:9200')
# print(es)

names = ['刘一', '陈二', '张三', '李四', '王五', '赵六', '孙七', '周八', '吴九', '郑十']
sexs = ['男', '女']
subjects = ['语文', '数学', '英语', '生物', '地理']
grades = [85, 77, 96, 74, 85, 69, 84, 59, 67, 69, 86, 96, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]
datas = []

start = time.time()
# 开始批量写入es数据库
# 批量写入数据
for j in range(1000):
    print(j)
    action = [
        {
            "_index": "grade",
            "_type": "doc",
            "_id": i,
            "_source": {
                "id": i,
                "name": random.choice(names),
                "sex": random.choice(sexs),
                "subject": random.choice(subjects),
                "grade": random.choice(grades)
            }
        } for i in range(10000 * j, 10000 * j + 10000)
    ]
    helpers.bulk(es, action)
end = time.time()
print('花费时间:', end - start)

elasticsearch-head中显示:

示例代码2:    【一次性写入10000*5000条数据】  【本人亲测耗时23000秒】

from elasticsearch import Elasticsearch
from elasticsearch import helpers
import random
import time

es = Elasticsearch(hosts='http://127.0.0.1:9200')
# print(es)

names = ['刘一', '陈二', '张三', '李四', '王五', '赵六', '孙七', '周八', '吴九', '郑十']
sexs = ['男', '女']
subjects = ['语文', '数学', '英语', '生物', '地理']
grades = [85, 77, 96, 74, 85, 69, 84, 59, 67, 69, 86, 96, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]
datas = []

start = time.time()
# 开始批量写入es数据库
# 批量写入数据
for j in range(5000):
    print(j)
    action = [
        {
            "_index": "grade3",
            "_type": "doc",
            "_id": i,
            "_source": {
                "id": i,
                "name": random.choice(names),
                "sex": random.choice(sexs),
                "subject": random.choice(subjects),
                "grade": random.choice(grades)
            }
        } for i in range(10000 * j, 10000 * j + 10000)
    ]
    helpers.bulk(es, action)
end = time.time()
print('花费时间:', end - start)

 示例代码3:  【一次性写入10000*9205条数据】  【耗时过长】

from elasticsearch import Elasticsearch
from elasticsearch import helpers
import random
import time

es = Elasticsearch(hosts='http://127.0.0.1:9200')

names = ['刘一', '陈二', '张三', '李四', '王五', '赵六', '孙七', '周八', '吴九', '郑十']
sexs = ['男', '女']
subjects = ['语文', '数学', '英语', '生物', '地理']
grades = [85, 77, 96, 74, 85, 69, 84, 59, 67, 69, 86, 96, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]
datas = []

start = time.time()
# 开始批量写入es数据库
# 批量写入数据
for j in range(9205):
    print(j)
    action = [
        {
            "_index": "grade2",
            "_type": "doc",
            "_id": i,
            "_source": {
                "id": i,
                "name": random.choice(names),
                "sex": random.choice(sexs),
                "subject": random.choice(subjects),
                "grade": random.choice(grades)
            }
        } for i in range(10000*j, 10000*j+10000)
    ]
    helpers.bulk(es, action)
end = time.time()
print('花费时间:', end - start)

查询数据并计算各种方式的成绩总分。

示例代码4:   【一次性获取所有的数据,在程序中分别计算所耗的时间】

from elasticsearch import Elasticsearch
import time


def search_data(es, size=10):
    query = {
        "query": {
            "match_all": {}
        }
    }
    res = es.search(index='grade', body=query, size=size)
    # print(res)
    return res


if __name__ == '__main__':
    start = time.time()
    es = Elasticsearch(hosts='http://192.168.1.1:9200')
    # print(es)
    size = 10000
    res = search_data(es, size)
    # print(type(res))
    # total = res['hits']['total']['value']
    # print(total)
    all_source = []
    for i in range(size):
        source = res['hits']['hits'][i]['_source']
        all_source.append(source)
        # print(source)

    # 统计查询出来的所有学生的所有课程的所有成绩的总成绩
    start1 = time.time()
    all_grade = 0
    for data in all_source:
        all_grade += int(data['grade'])
    print('所有学生总成绩之和:', all_grade)
    end1 = time.time()
    print("耗时:", end1 - start1)

    # 统计查询出来的每个学生的所有课程的所有成绩的总成绩
    start2 = time.time()
    names1 = []
    all_name_grade = {}
    for data in all_source:
        if data['name'] in names1:
            all_name_grade[data['name']] += data['grade']
        else:
            names1.append(data['name'])
            all_name_grade[data['name']] = data['grade']
    print(all_name_grade)
    end2 = time.time()
    print("耗时:", end2 - start2)

    # 统计查询出来的每个学生的每门课程的所有成绩的总成绩
    start3 = time.time()
    names2 = []
    subjects = []
    all_name_all_subject_grade = {}
    for data in all_source:
        if data['name'] in names2:
            if all_name_all_subject_grade[data['name']].get(data['subject']):
                all_name_all_subject_grade[data['name']][data['subject']] += data['grade']
            else:
                all_name_all_subject_grade[data['name']][data['subject']] = data['grade']
        else:
            names2.append(data['name'])
            all_name_all_subject_grade[data['name']] = {}
            all_name_all_subject_grade[data['name']][data['subject']] = data['grade']
    print(all_name_all_subject_grade)
    end3 = time.time()
    print("耗时:", end3 - start3)
    end = time.time()
    print('总耗时:', end - start)

运行结果:

在示例代码4中当把size由10000改为 2000000时,运行效果如下所示:

        在项目中一般不用上述代码4中所统计成绩的方法,面对大量的数据是比较耗时的,要使用es中的聚合查询。计算数据中所有成绩之和。

示例代码5:  【使用普通计算方法和聚类方法做对比验证】

from elasticsearch import Elasticsearch
import time


def search_data(es, size=10):
    query = {
        "query": {
            "match_all": {}
        }
    }
    res = es.search(index='grade', body=query, size=size)
    # print(res)
    return res


def search_data2(es, size=10):
    query = {
        "aggs": {
            "all_grade": {
                "terms": {
                    "field": "grade",
                    "size": 1000
                }
            }
        }
    }
    res = es.search(index='grade', body=query, size=size)
    # print(res)
    return res


if __name__ == '__main__':
    start = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    size = 2000000
    res = search_data(es, size)
    all_source = []
    for i in range(size):
        source = res['hits']['hits'][i]['_source']
        all_source.append(source)
        # print(source)

    # 统计查询出来的所有学生的所有课程的所有成绩的总成绩
    start1 = time.time()
    all_grade = 0
    for data in all_source:
        all_grade += int(data['grade'])
    print('200万数据所有学生总成绩之和:', all_grade)
    end1 = time.time()
    print("耗时:", end1 - start1)

    end = time.time()
    print('200万数据总耗时:', end - start)

    # 聚合操作
    start_aggs = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    # size = 2000000
    size = 0
    res = search_data2(es, size)
    # print(res)

    aggs = res['aggregations']['all_grade']['buckets']
    print(aggs)

    sum = 0
    for agg in aggs:
        sum += (agg['key'] * agg['doc_count'])

    print('1000万数据总成绩之和:', sum)
    end_aggs = time.time()
    print('1000万数据总耗时:', end_aggs - start_aggs)

运行结果:

计算数据中每个同学的各科总成绩之和。 

示例代码6:  【子聚合】【先分组,再计算】

from elasticsearch import Elasticsearch
import time


def search_data(es, size=10):
    query = {
        "query": {
            "match_all": {}
        }
    }
    res = es.search(index='grade', body=query, size=size)
    # print(res)
    return res


def search_data2(es):
    query = {
        "size": 0,
        "aggs": {
            "all_names": {
                "terms": {
                    "field": "name.keyword",
                    "size": 10
                },
                "aggs": {
                    "total_grade": {
                        "sum": {
                            "field": "grade"
                        }
                    }
                }
            }
        }
    }
    res = es.search(index='grade', body=query)
    # print(res)
    return res


if __name__ == '__main__':
    start = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    size = 2000000
    res = search_data(es, size)
    all_source = []
    for i in range(size):
        source = res['hits']['hits'][i]['_source']
        all_source.append(source)
        # print(source)

    # 统计查询出来的每个学生的所有课程的所有成绩的总成绩
    start2 = time.time()
    names1 = []
    all_name_grade = {}
    for data in all_source:
        if data['name'] in names1:
            all_name_grade[data['name']] += data['grade']
        else:
            names1.append(data['name'])
            all_name_grade[data['name']] = data['grade']
    print(all_name_grade)
    end2 = time.time()
    print("200万数据耗时:", end2 - start2)

    end = time.time()
    print('200万数据总耗时:', end - start)

    # 聚合操作
    start_aggs = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    res = search_data2(es)
    # print(res)

    aggs = res['aggregations']['all_names']['buckets']
    # print(aggs)
    dic = {}
    for agg in aggs:
        dic[agg['key']] = agg['total_grade']['value']

    print('1000万数据:', dic)
    end_aggs = time.time()
    print('1000万数据总耗时:', end_aggs - start_aggs)

运行结果:

计算数据中每个同学的每科成绩之和。 

示例代码7:

from elasticsearch import Elasticsearch
import time


def search_data(es, size=10):
    query = {
        "query": {
            "match_all": {}
        }
    }
    res = es.search(index='grade', body=query, size=size)
    # print(res)
    return res


def search_data2(es):
    query = {
        "size": 0,
        "aggs": {
            "all_names": {
                "terms": {
                    "field": "name.keyword",
                    "size": 10
                },
                "aggs": {
                    "all_subjects": {
                        "terms": {
                            "field": "subject.keyword",
                            "size": 5
                        },
                        "aggs": {
                            "total_grade": {
                                "sum": {
                                    "field": "grade"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    res = es.search(index='grade', body=query)
    # print(res)
    return res


if __name__ == '__main__':
    start = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    size = 2000000
    res = search_data(es, size)
    all_source = []
    for i in range(size):
        source = res['hits']['hits'][i]['_source']
        all_source.append(source)
        # print(source)

    # 统计查询出来的每个学生的每门课程的所有成绩的总成绩
    start3 = time.time()
    names2 = []
    subjects = []
    all_name_all_subject_grade = {}
    for data in all_source:
        if data['name'] in names2:
            if all_name_all_subject_grade[data['name']].get(data['subject']):
                all_name_all_subject_grade[data['name']][data['subject']] += data['grade']
            else:
                all_name_all_subject_grade[data['name']][data['subject']] = data['grade']
        else:
            names2.append(data['name'])
            all_name_all_subject_grade[data['name']] = {}
            all_name_all_subject_grade[data['name']][data['subject']] = data['grade']
    print('200万数据:', all_name_all_subject_grade)
    end3 = time.time()
    print("耗时:", end3 - start3)
    end = time.time()
    print('200万数据总耗时:', end - start)

    # 聚合操作
    start_aggs = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    res = search_data2(es)
    # print(res)

    aggs = res['aggregations']['all_names']['buckets']
    # print(aggs)

    dic = {}
    for agg in aggs:
        dic[agg['key']] = {}
        for sub in agg['all_subjects']['buckets']:
            dic[agg['key']][sub['key']] = sub['total_grade']['value']
    print('1000万数据:', dic)
    end_aggs = time.time()
    print('1000万数据总耗时:', end_aggs - start_aggs)

运行结果:

        在上面查询计算示例代码中,当使用含有1000万数据的索引grade时,普通方法查询计算是比较耗时的,使用聚合查询能够大大节约大量时间。当面对9205万数据的索引grade2时,这时使用普通计算方法所消耗的时间太大了,在线上开发环境中是不可用的,所以必须使用聚合方法来计算。

示例代码8:

from elasticsearch import Elasticsearch
import time


def search_data(es):
    query = {
        "size": 0,
        "aggs": {
            "all_names": {
                "terms": {
                    "field": "name.keyword",
                    "size": 10
                },
                "aggs": {
                    "all_subjects": {
                        "terms": {
                            "field": "subject.keyword",
                            "size": 5
                        },
                        "aggs": {
                            "total_grade": {
                                "sum": {
                                    "field": "grade"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    res = es.search(index='grade2', body=query)
    # print(res)
    return res


if __name__ == '__main__':
    # 聚合操作
    start_aggs = time.time()
    es = Elasticsearch(hosts='http://127.0.0.1:9200')
    res = search_data(es)
    # print(res)

    aggs = res['aggregations']['all_names']['buckets']
    # print(aggs)

    dic = {}
    for agg in aggs:
        dic[agg['key']] = {}
        for sub in agg['all_subjects']['buckets']:
            dic[agg['key']][sub['key']] = sub['total_grade']['value']
    print('9205万数据:', dic)
    end_aggs = time.time()
    print('9205万数据总耗时:', end_aggs - start_aggs)

运行结果:

注意:写查询语句时建议使用kibana去写,然后复制查询语句到代码中,kibana会提示查询语句。

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Scrapy是一种优秀的开源网络爬虫框架,可以用于从网页中爬取数据。借助其强大的功能,我们可以轻松地将爬取到的数据写入数据库。 首先,我们需要创建一个Scrapy项目并配置好爬虫。在项目中,我们可以定义Item类来表示我们需要提取的数据字段。通过编写爬虫规则,我们可以指定要爬取的网页、需要提取的数据字段以及数据的处理方式。 在编写完爬虫规则后,Scrapy会自动将爬取到的数据封装成Item对象。我们可以在爬虫的回调函数中对这些Item对象进行处理,例如将数据写入数据库。 为了将数据写入数据库,我们可以使用Python数据库操作库,如MySQLdb或者pymysql。首先,我们需要连接到数据库,并创建一个数据库连接对象。然后,我们可以将爬取到的数据逐条插入到数据库中。 插入数据的具体步骤如下: 1. 导入数据库操作库 2. 连接到数据库 3. 创建游标对象 4. 遍历爬取到的数据 5. 构造插入语句 6. 执行插入操作 7. 提交事务 8. 关闭游标和数据库连接 通过以上步骤,我们可以将爬取到的数据成功写入数据库。 值得注意的是,在爬取大量数据时,为了提高性能和效率,我们可以使用异步IO库,如aiomysql或aiopg,来实现异步插入操作。 总而言之,Python Scrapy可以轻松实现数据的网页爬取,并通过数据库操作库将数据写入数据库。这样,我们可以方便地对爬取到的数据进行存储和管理。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值