NLTK实现分词

最新推荐文章于 2024-07-11 08:30:12 发布

Rhichard_CHAN

最新推荐文章于 2024-07-11 08:30:12 发布

阅读量5.6k

点赞数 2

分类专栏： pycharm 文章标签： python nltk sql

本文链接：https://blog.csdn.net/qq_40620534/article/details/108976551

版权

pycharm 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

本篇主要记录在用python写nltk分词操作项目主要出现的错误以及改进的方法。
本文利用nltk，从数据库中获取文本并进行去停用词处理，并将处理结果放入数据库。

一、nltk是什么？

Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。
NLTK是一个开源的项目，包含：Python模块，数据集和教程，用于NLP的研究和开发 [1] 。
NLTK由Steven Bird和Edward Loper在宾夕法尼亚大学计算机和信息科学系开发。
NLTK包括图形演示和示例数据。其提供的教程解释了工具包支持的语言处理任务背后的基本概念

在本文中主要用来对文本进行去停用词处理

二、实现代码

主要用到nltk包和pandas，可以通过以下命令进行安装：

pip install nltk
pip install pandas

import pymysql
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

con=pymysql.connect(
    host='localhost',
    port=3306,
    user='root',
    passwd='123',
    db='nce',
    charset='utf8',
    )
def insert(con,frequent,l):
    cue = con.cursor()
    # print("mysql conneted")
    try:
        print(str(frequent))
        print(l)
        cue.execute(
            "update article set frequent=(%s) where a_id=(%s)",[str(frequent),l])
        print("insert success")

    except Exception as e:
        print('Insert error:', e)
        con.rollback()
    else:
        con.commit()
    cue.close()

def read():
    cue = con.cursor()
    query = """select text 
    from article
    """
    stop_words = set(stopwords.words('english'))
    cue.execute(query)
    result = cue.fetchall()
    df_resulet = pd.DataFrame(list(result))
    for l in df_resulet.index:
        text = str(df_resulet.loc[l].values)
        word_tokens = word_tokenize(text[1:-1])
        
        filtered_sentence = [w for w in word_tokens if not w in stop_words]
        # print(filtered_sentence[1:-1])
        insert(con,filtered_sentence[1:-1],l+1)
read ()
con.close()

三、过程出现的错误

1. 数组格式不能直接用于String类型

df_resulet = pd.DataFrame(list(result)) 
   for l in df_resulet.index:
        text = df_resulet.loc[l].values

报错代码如下：

TypeError: cannot use a string pattern on a bytes-like object

改进方法：就直接强转成string类型就行

 text = str(df_resulet.loc[l].values)

2.插入错误

一：代码如下（示例）：

def insert(con,frequent,l):
    cue = con.cursor()
    # print("mysql conneted")
    try:
        # print(frequent)
        cue.execute(
           "insert into article (frequent) values(%s)",[frequent])
        print("insert success")

    except Exception as e:
        print('Insert error:', e)
        con.rollback()
    else:
        con.commit()

Insert error: (1241, 'Operand should contain 1 column(s)')

这里的错误是说：插入的数据应该包含一列，也就是说我插入的数据不止一列。

解决办法：

首先，我传入的是在def read（）中强转str0的变量，拿到sql语句中，就变成了数组，所以是有多少个字符，就有多少个列，这样当然插入不进，只要在语句中再强转一次就行。

修改后代码如下：

cue.execute(
            "insert into article (frequent) values(%s)",str(frequent))

3.sql更新错误

    try:
        cue.execute(
            "update article set frequent=(%s) where a_id=(%s)",[str(frequent),l])
        print("insert success")

Lock wait timeout exceeded; try restarting transaction

原因：
因为sql的update查询语句是很耗时的，在查询过程导致锁了，每次更新操作等了50秒还是失败，解决办法也很简单，

1,查看当前数据库的线程情况：

SHOW FULL PROCESSLIST

在这里插入图片描述
查看有没耗时特别长的，再去查看innodb的事务表INNODB_TRX，看下里面是否有正在锁定的事务线程，看看ID是否在show full processlist里面的sleep线程中，如果是，就证明这个sleep的线程事务一直没有commit或者rollback而是卡住了，直接kill掉。

2,查看innodb的事务表INNODB_TRX

SELECT * FROM information_schema.INNODB_TRX

在这里插入图片描述
由于我已经kill掉了，这里就没有显示了，有的话直接根据trx_mysql_thread_id下的值

kill xxxxxx

python学的不久，代码很简陋，很多不足的地方，欢迎大家提出改进。

Rhichard_CHAN

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
NLTK实现分词

前言本篇主要记录在用python写nltk分词操作项目主要出现的错误以及改进的方法。本文利用nltk，从数据库中获取文本并进行去停用词处理，并将处理结果放入数据库。一、nltk是什么？Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。NLTK是一个开源的项目，包含：Python模块，数据集和教程，用于NLP的研究和开发 [1] 。NLTK由Steven Bird和Edward Loper在宾夕法尼亚大学计算机和信息科学系开发。
复制链接

扫一扫

专栏目录