ansj 自定义 停用词_构造自定义停用词列表的快速提示

ansj 自定义 停用词

by Kavita Ganesan

通过Kavita Ganesan

构造自定义停用词列表的快速提示 (Quick tips for constructing custom stop word lists)

In natural language processing (NLP) and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

在自然语言处理(NLP)和文本挖掘应用程序中,停用词用于消除不重要的词,从而使应用程序专注于重要的词。

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is”, and “and” would easily qualify as stop words.

停用词是一组使用任何语言的常用词。 例如,用英语,“ the”,“ is”和“ and”将很容易被视为停用词。

While there are various published stop words that one can use, in many cases these stop words are insufficient as they are not domain-specific. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur in almost every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval.

尽管可以使用各种已发布的停用词,但在许多情况下,这些停用词是不够的,因为它们不是特定于域的。 例如,在临床文本中,诸如“ mcg”,“ dr。”之类的术语 在您遇到的几乎所有文档中都会出现“耐心”的表情。 因此,这些术语可能被视为临床文本挖掘和检索的潜在停用词。

Similarly, for tweets, terms like “#”, “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally does not cover such domain-specific terms.

类似地,对于推文,诸如“#”,“ RT”,“ @ username”之类的词可能被视为停用词。 特定于公共语言的停用词列表通常涵盖此类特定于域的术语。

The good news is that it is actually fairly easy to construct your own domain-specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text from the domain of interest, you can do one or more of the following to curate your stop word list:

好消息是,构建您自己的特定于域的停用词列表实际上相当容易。 假设您有来自感兴趣领域的大量文本,这是几种方法,您可以执行以下一项或多项操作来组织停用词列表:

1.最常用的术语作为停用词 (1. Most frequent terms as stop words)

Sum the term frequencies of each unique word (w) across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top K terms to be your stop words.

对集合中所有文档中每个唯一单词( w )的术语频率求和。 按原始术语频率的降序对术语进行排序。 您可以将前K个词作为停用词。

You can also eliminate common English words (using a published stop list) prior to sorting so that you target the domain-specific stop words.

您还可以在排序之前消除常见的英语单词(使用已发布的停用词),以便定位特定于域的停用词。

Another option is to treat words occurring in more X% of your documents as stop words. I have found eliminating words that appear in 85% of documents to be effective in several text mining tasks. The benefit of this approach is that it is really easy to implement. The downside, however, is if you have a particularly long document, the raw term frequency from just a few documents can dominate and cause the term to be at the top. One way to resolve this is to normalize the raw term frequency using a normalizer such as the document length — the number of words in a given document.

另一种选择是将文档中X%以上出现的单词视为停用词。 我发现消除出现在文档中85%的单词在某些文本挖掘任务中是有效的。 这种方法的好处是它真的很容易实现。 但是,不利的一面是,如果您的文档特别长,那么仅几个文档中的原始术语频率就可能占主导地位,并使该术语排在最前。 解决此问题的一种方法是使用规范化程序(例如文档长度,即给定文档中的单词数)对原始术语频率进行规范化。

2.最少用词作为停用词 (2. Least frequent terms as stop words)

Just as terms that are extremely frequent could be distracting terms rather than discriminating terms, terms that are extremely infrequent may also not be useful for text mining and retrieval. For example, the username “@username” that occurs only once in a collection of tweets, may not be very useful. Other terms like “yoMateZ!” which could just be made-up terms by people again may not be useful for text mining applications.

正如非常频繁使用的术语可能会分散注意力而不是区分术语一样,非常不常用的术语也可能不适用于文本挖掘和检索。 例如,在一组推文中仅出现一次的用户名“ @username”可能不是很有用。 其他术语如“ yoMateZ!” 可能只是人们再次编造的术语,对于文本挖掘应用程序可能没有用。

Note: certain terms like “yaaaaayy!!” can often be normalized to standard forms such as “yay”.

注意 :某些术语,例如“ yaaaaayy !!” 通常可以标准化为标准格式,例如“是”。

However, despite all the normalization, if a term still has a frequency count of one you could remove it. This could significantly reduce your overall feature space.

但是,尽管进行了所有归一化,但是如果一个术语的频率计数仍为1,则可以将其删除。 这可能会大大减少您的整体功能空间。

3.低IDF词作为停用词 (3. Low IDF terms as stop words)

Inverse document frequency (IDF) refers to the inverse fraction of documents in your collection that contains a specific term (ti). Let us say:

反向文档频率(IDF)指集合中包含特定术语( ti )的文档的反向比例。 让我们说:

  • you have N documents

    你有N个文件

  • term ti occurred in M of the N documents.

    ti一词出现在N个文档中的M个中。

The IDF of ti is thus computed as:

因此, ti的IDF计算为:

So, the more documents ti appears in, the lower the IDF score. This means terms that appear in every document will have an IDF score of 0.

因此, ti出现的文件越多, IDF得分就越低。 这意味着出现在每个文档中的术语的IDF得分均为0。

If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words.

如果按IDF分数按降序对集合中的每个ti进行排名,则可以将IDF分数最低的后 K个词作为停用词。

Again, you can also eliminate common English words using a published stop list prior to sorting so that you target the domain-specific low IDF words. This is not necessary if your K is large enough that it will prune both general stop words as well as domain-specific stop words. You will find more information about IDFs here.

同样,您还可以在排序之前使用已发布的停止列表消除常见的英语单词,从而定位特定域的低IDF单词 。 如果您的K足够大,可以修剪一般停用词和特定于域的停用词,则没有必要。 您可以在此处找到有关IDF的更多信息。

那么,停词会帮助我完成任务吗? (So, would stop words help my task?)

How would you know if removing domain specific stop words would be helpful in your case? Easy — test it on a subset of your data. See if whatever measure of accuracy and performance improves, stays constant, or degrades. If it degrades, needless to say, don’t do it unless the degradation is negligible and you see gains in other forms such as a decrease in size of model, or the ability to process things in memory.

您怎么知道删除特定于域的停用词是否对您有帮助? 轻松-在数据的子集上对其进行测试。 查看准确性和性能的任何衡量指标是否得到改善,保持不变或降低。 如果它降级了,不用说,除非降级可以忽略不计,否则您就不要这样做,您会看到其他形式的收益,例如模型尺寸的减小或处理内存中事物的能力。

翻译自: https://www.freecodecamp.org/news/quick-tips-for-constructing-custom-stop-word-lists-c22b40a25169/

ansj 自定义 停用词

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值