(2-3)文本预处理算法:去除停用词(Stopword Removal)

本文介绍了自然语言处理中去除停用词的基本概念,包括什么是停用词,以及基于词汇列表、词频、TF-IDF和机器学习方法的去除策略。展示了如何通过编程实现在文本分析中去除这些常见但无实际意义的词汇,以提高处理效率和准确性。
摘要由CSDN通过智能技术生成

2.3  去除停用词(Stopword Removal)

去除停用词(Stop Words)是自然语言处理中的一个常见任务,它旨在去除文本中的常见、无实际语义的词语,以便更准确地进行文本分析和处理。停用词通常包括像“a”、“an”、“the”、“in”、“on”等常见的词汇。

2.3.1  什么是停用词

停用词(Stop Words)是自然语言处理中的一类常见词汇,通常是一些在文本中频繁出现但通常被认为没有实际语义或信息价值的词汇。这些词汇通常包括常见的连接词、介词、冠词、代词和一些常见的动词等。

停用词的存在是因为它们在文本中广泛出现,但通常对文本分析和处理任务没有太多的信息价值,因为它们在不同的文本中都会出现。因此,去除这些停用词可以减少文本中的噪声,使文本处理更加准确和有效。

在现实应用中,一些常见的停用词包括:

  1. 冠词:a, an, the
  2. 介词:in, on, at, by
  3. 连接词:and, or, but
  4. 代词:I, you, he, she, it
  5. 助动词:is, am, are, have, has, do, does

停用词的具体列表可以根据不同的自然语言处理任务和语言而有所不同。去除停用词通常是文本预处理的一部分,以净化文本并减少在文本分析中的干扰。去除停用词后,文本分析算法可以更关注那些具有更高信息价值的词汇,从而提高文本处理的效率和准确性。

2.3.2  基于词汇列表的去除

最简单的去除停用词方法是使用预定义的停用词列表,将文本中包含在列表中的词汇去除。这些列表通常包括常见的连接词、介词、冠词等。例如下面是一个基于词汇列表的去除停用词例子。

实例2-13基于词汇列表的去除停用词(源码路径:daima/2/

[root@QAQ ~]# sudo tail -n 50 /var/log/mysql/error.log 2023-07-14T02:45:21.370949Z 0 [Note] Shutting down plugin 'partition' 2023-07-14T02:45:21.370952Z 0 [Note] Shutting down plugin 'BLACKHOLE' 2023-07-14T02:45:21.370954Z 0 [Note] Shutting down plugin 'ARCHIVE' 2023-07-14T02:45:21.370956Z 0 [Note] Shutting down plugin 'PERFORMANCE_SCHEMA' 2023-07-14T02:45:21.370993Z 0 [Note] Shutting down plugin 'MRG_MYISAM' 2023-07-14T02:45:21.370995Z 0 [Note] Shutting down plugin 'MyISAM' 2023-07-14T02:45:21.371003Z 0 [Note] Shutting down plugin 'INNODB_SYS_VIRTUAL' 2023-07-14T02:45:21.371011Z 0 [Note] Shutting down plugin 'INNODB_SYS_DATAFILES' 2023-07-14T02:45:21.371013Z 0 [Note] Shutting down plugin 'INNODB_SYS_TABLESPACES' 2023-07-14T02:45:21.371015Z 0 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN_COLS' 2023-07-14T02:45:21.371017Z 0 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN' 2023-07-14T02:45:21.371018Z 0 [Note] Shutting down plugin 'INNODB_SYS_FIELDS' 2023-07-14T02:45:21.371020Z 0 [Note] Shutting down plugin 'INNODB_SYS_COLUMNS' 2023-07-14T02:45:21.371022Z 0 [Note] Shutting down plugin 'INNODB_SYS_INDEXES' 2023-07-14T02:45:21.371024Z 0 [Note] Shutting down plugin 'INNODB_SYS_TABLESTATS' 2023-07-14T02:45:21.371026Z 0 [Note] Shutting down plugin 'INNODB_SYS_TABLES' 2023-07-14T02:45:21.371028Z 0 [Note] Shutting down plugin 'INNODB_FT_INDEX_TABLE' 2023-07-14T02:45:21.371030Z 0 [Note] Shutting down plugin 'INNODB_FT_INDEX_CACHE' 2023-07-14T02:45:21.371032Z 0 [Note] Shutting down plugin 'INNODB_FT_CONFIG' 2023-07-14T02:45:21.371033Z 0 [Note] Shutting down plugin 'INNODB_FT_BEING_DELETED' 2023-07-14T02:45:21.371035Z 0 [Note] Shutting down plugin 'INNODB_FT_DELETED' 2023-07-14T02:45:21.371037Z 0 [Note] Shutting down plugin 'INNODB_FT_DEFAULT_STOPWORD' 2023-07-14T02:45:21.371039Z 0 [Note] Shutting down plugin 'INNODB_METRICS' 2023-07-14T02:45:21.371041Z 0 [Note] Shutting down plugin 'INNODB_TEMP_TABLE_INFO' 2023-07-14T02:45:21.371043Z 0 [Note] Shutting down plugin 'INNODB_BUFFER_POOL_STATS' 2023-07-14T02:45:21.371045Z 0 [Note] Shutting down plugin 'INNODB_BUFFER_PAGE_LRU' 2023-07-14T02:45:21.371047Z 0 [Note] Shutting down plugin 'INNODB_BUFFER_PAGE' 2023-07-14T02:45:21.371049Z 0 [Note] Shutting down plugin 'INNODB_CMP_PER_INDEX_RESET' 2023-07-14T02:45:21.371050Z 0 [Note] Shutting down plugin 'INNODB_CMP_PER_INDEX' 2023-07-14T02:45:21.371052Z 0 [Note] Shutting down plugin 'INNODB_CMPMEM_RESET' 2023-07-14T02:45:21.371054Z 0 [Note] Shutting down plugin 'INNODB_CMPMEM' 2023-07-14T02:45:21.371056Z 0 [Note] Shutting down plugin 'INNODB_CMP_RESET' 2023-07-14T02:45:21.371058Z 0 [Note] Shutting down plugin 'INNODB_CMP' 2023-07-14T02:45:21.371060Z 0 [Note] Shutting down plugin 'INNODB_LOCK_WAITS' 2023-07-14T02:45:21.371062Z 0 [Note] Shutting down plugin 'INNODB_LOCKS' 2023-07-14T02:45:21.371064Z 0 [Note] Shutting down plugin 'INNODB_TRX' 2023-07-14T02:45:21.371066Z 0 [Note] Shutting down plugin 'InnoDB' 2023-07-14T02:45:21.371100Z 0 [Note] InnoDB: FTS optimize thread exiting. 2023-07-14T02:45:21.371135Z 0 [Note] InnoDB: Starting shutdown... 2023-07-14T02:45:21.471280Z 0 [Note] InnoDB: Dumping buffer pool(s) to /www/server/data/ib_buffer_pool 2023-07-14T02:45:21.471421Z 0 [Note] InnoDB: Buffer pool(s) dump completed at 230714 10:45:21 2023-07-14T02:45:22.992635Z 0 [Note] InnoDB: Shutdown completed; log sequence number 2767468 2023-07-14T02:45:22.993964Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2023-07-14T02:45:22.993980Z 0 [Note] Shutting down plugin 'MEMORY' 2023-07-14T02:45:22.993985Z 0 [Note] Shutting down plugin 'CSV' 2023-07-14T02:45:22.993989Z 0 [Note] Shutting down plugin 'sha256_password' 2023-07-14T02:45:22.993991Z 0 [Note] Shutting down plugin 'mysql_native_password' 2023-07-14T02:45:22.994103Z 0 [Note] Shutting down plugin 'binlog' 2023-07-14T02:45:22.994915Z 0 [Note] /www/server/mysql/bin/mysqld: Shutdown complete
07-20
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

感谢鼓励

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值