模糊查询分词索引

最新推荐文章于 2024-07-25 16:30:40 发布

Aa4a

最新推荐文章于 2024-07-25 16:30:40 发布

阅读量230

点赞数 3

文章标签： python

本文链接：https://blog.csdn.net/2302_79067143/article/details/136597628

版权

本文介绍了如何使用Python的pandas库生成大量随机数据，通过Faker生成姓名和日期，对姓名进行分词，然后将分词结果和原始数据存储到MySQL数据库的过程。

摘要由CSDN通过智能技术生成

import pandas as pd

from faker import Faker

from sqlalchemy import create_engine

from itertools import combinations


#先创建数据
# 创建 Faker 对象

fake = Faker()



# 生成随机数据

data = {

    'id': range(1, 5000001),

    'name': [fake.name() for _ in range(5000000)],

    'datetime': [fake.date_time_this_decade() for _ in range(5000000)]

}



# 创建 DataFrame

df = pd.DataFrame(data)


#进行分词
token_index = {}


for index, row in df.iterrows():

    name = row['name']

    words = name.split()



    for i in range(1, len(words) + 1):

        for combo in combinations(words, i):

            key = ' '.join(combo)

            if key not in token_index:

                token_index[key] = []

            token_index[key].append(row['id'])



df2 = pd.DataFrame(list(token_index.items()), columns=['keywords', 'word_index'])

#将分词索引保存
df2.to_csv('./word.csv')



# 连接 MySQL 数据库

engine = create_engine('mysql://root:123456@localhost/test')



# 将数据插入到 MySQL 数据库中

df.to_sql('user', con=engine, if_exists='append', index=False)