python nlp库_使用不带NLP库的Python创建文本摘要

最新推荐文章于 2022-05-17 00:29:31 发布

weixin_26737625

最新推荐文章于 2022-05-17 00:29:31 发布

阅读量459

点赞数

文章标签： python

原文链接：https://medium.com/better-programming/create-text-summary-using-python-without-nlp-libraries-3b0f94af585f

版权

本文介绍了如何在不使用NLP库的情况下，通过预处理、统计和排序对文本进行总结。主要步骤包括将文本拆分成句子、标记化单词、排除停用词、为单词赋值、评分句子以及创建加权直方图。最终，通过加权直方图选取最高得分的句子作为摘要，将600个单词的文本概括为65个单词。这种方法展示了机器学习在自然语言处理中的应用。

摘要由CSDN通过智能技术生成

python nlp库

There are several NLP libraries to work with, for example, Natural Language Toolkit (NLTK), TextBlob, CoreNLP, Gensim, and spaCy. You can use these incredible libraries for processing your text.

有多个NLP库可以使用，例如Natural Language Toolkit (NLTK)， TextBlob ， CoreNLP ， Gensim和spaCy 。您可以使用这些令人难以置信的库来处理文本。

There are a lot of methods for summarizing texts. In this article, I’m going to show you the easiest way to summarize your texts into three sentences without using any of the NLP libraries.

有很多总结文本的方法。在本文中，我将向您展示最简单的方法，无需使用任何NLP库即可将文本概括为三个句子。

However, we will need some libraries for pre-processing and sorting the data.

但是，我们将需要一些库来对数据进行预处理和排序。

需要图书馆 (Libraries Required)

import re
import heapq

Suppose we are going to summarize the following text block, which has 600 words:

假设我们将总结以下具有600个单词的文本块：

“ Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore’s law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).”

“ 直到1980年代，大多数自然语言处理系统都是基于复杂的手写规则集。 然而，从1980年代后期开始，自然语言处理发生了一场革命，引入了用于语言处理的机器学习算法。 这是由于计算能力的稳步提高(请参阅摩尔定律)和乔姆斯基语言学理论(例如变换语法)的主导地位的逐渐降低，其理论基础阻碍了机器学习方法基础的语料库语言学的发展。语言处理。[3] 一些最早使用的机器学习算法(例如决策树)产生了类似于现有手写规则的硬if-then规则系统。 但是，词性标记将隐马尔可夫模型引入自然语言处理，并且越来越多的研究集中在统计模型上，该模型基于将实值权重附加到构成输入的特征上来做出软概率决策数据。 现在许多语音识别系统所依赖的缓存语言模型就是这种统计模型的例子。 当给定不熟悉的输入时(尤其是包含错误的输入(对于现实世界的数据来说很常见))，此类模型通常会更健壮，并且在集成到包含多个子任务的较大系统中时可以产生更可靠的结果。尤其是在IBM Research的工作中，在机器翻译领域取得了很大的进步，在此之后，开发了更为复杂的统计模型。 这些系统能够利用加拿大议会和欧洲联盟根据法律要求将所有政府程序翻译成相应政府系统的所有正式语言而产生的现有多语种文本语料库。 但是，大多数其他系统依赖于专门为这些系统实现的任务而开发的语料库，这是(并且经常继续)成为这些系统成功的主要限制。 结果，大量研究进入了从有限数量的数据中更有效地学习的方法。 最近的研究越来越集中在无监督和半监督学习算法上。 此类算法可以从尚未用所需答案手动注释的数据中学习，也可以从已注释和未注释数据的组合中学习。 通常，此任务比监督学习要困难得多，并且对于给定数量的输入数据通常会产生不太准确的结果。 但是，存在大量可用的非注释数据(尤其包括万维网的全部内容)，如果所用算法的时间复杂度足够低，则通常可以弥补劣等结果。在2010年代，表示学习和深度神经网络式机器学习方法在自然语言处理中得到了广泛应用，部分原因是一系列结果表明，此类技术[4] [5]可以实现最先进的状态。艺术导致许多自然语言任务，例如语言建模，[6]解析，[7] [8]等。流行的技术包括使用单词嵌入来捕获单词的语义属性，以及增加对更高级别任务(例如，问题回答)的端到端学习，而不是依赖于单独的中间任务(例如，词性标记和依赖项解析)。 在某些领域，这种转变要求NLP系统的设计方式发生重大变化，因此基于深度神经网络的方法可以被视为不同于统计自然语言处理的新范例。 例如，术语“神经机器翻译(NMT)”强调了这样一个事实，即基于深度学习的机器翻译方法直接学习序列到序列的转换，而无需诸如统计中使用的单词对齐和语言建模之类的中间步骤机器翻译(SMT)。 ”

Let’s load the text in a string.

让我们将文本加载到字符串中。

text = “Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore’s law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).”

Now, split the sentences of the text string.

现在，拆分文本字符串的句子。

sentences = re.split(r’ *[\.\?!][\’”\)\]]* *’, text)

For preprocessing, we are going to lower the text and split the text into words (word_tokenize).

对于预处理，我们将word_tokenize文本并将其拆分为单词( word_tokenize )。

clean_text = text.lower()
word_tokenize = clean_text.split()

We also need to exclude the stopwords of the language we need to summarize. You can get the stopwords of your desired language from the Countwordsfree website: https://countwordsfree.com/stopwords

我们还需要排除需要总结的语言的停用词。您可以从Countwordsfree网站获取所需语言的停用词： https ：//countwordsfree.com/stopwords

stop_words = [“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “his”, “himself”, “she”, “her”, “hers”, “herself”, “it”, “its”, “itself”, “they”, “them”, “their”, “theirs”, “themselves”, “what”, “which”, “who”, “whom”, “this”, “that”, “these”, “those”, “am”, “is”, “are”, “was”, “were”, “be”, “been”, “being”, “have”, “has”, “had”, “having”, “do”, “does”, “did”, “doing”, “a”, “an”, “the”, “and”, “but”, “if”, “or”, “because”, “as”, “until”, “while”, “of”, “at”, “by”, “for”, “with”, “about”, “against”, “between”, “into”, “through”, “during”, “before”, “after”, “above”, “below”, “to”, “from”, “up”, “down”, “in”, “out”, “on”, “off”, “over”, “under”, “again”, “further”, “then”, “once”, “here”, “there”, “when”, “where”, “why”, “how”, “all”, “any”, “both”, “each”, “few”, “more”, “most”, “other”, “some”, “such”, “no”, “nor”, “not”, “only”, “own”, “same”, “so”, “than”, “too”, “very”, “s”, “t”, “can”, “will”, “just”, “don”, “should”, “now”]

We have put all the stop words of the English language in a list. You can add other language’s stopwords and append them to this list.

我们已将所有英语停用词列在列表中。您可以添加其他语言的停用词并将其附加到此列表中。

Next we are going to tokenize all the words into a dictionary and give them values.

接下来，我们将标记所有单词到词典中并为其赋予值。

word2count = {}
for word in word_tokenize:
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

Now, with the value from word2count, we are going to score the sentences.

现在，使用word2count的值，我们将对句子评分。

sent2score = {}
for sentence in sentences:
    for word in  sentence.split():
        if word in word2count.keys():
            if len(sentence.split(' ')) < 28 and len(sentence.split(' ')) > 9:
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]

Here we have taken only the sentences with, at most, 25 words and a minimum of nine words for summarization.

在这里，我们只采用了最多25个单词和最少9个单词的句子进行总结。

After that, we are going to create a weighted histogram.

之后，我们将创建一个加权直方图。

for key in word2count.keys():
    word2count[key] = word2count[key] / max(word2count.values())

Now we just have to sort the best tree sentences and see the results.

现在，我们只需要排序最好的树语句并查看结果即可。

best_three_sentences = heapq.nlargest(3, sent2score, key=sent2score.get)print(*best_three_sentences)

输出量 (Output)

“Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing The cache language models upon which many speech recognition systems now rely are examples of such statistical models [3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.”

“然而，从1980年代后期开始，自然语言处理发生了一场革命，引入了用于语言处理的机器学习算法。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的示例[3]最早使用的机器学习算法(例如决策树)产生了硬if-then规则系统，类似于现有的手写规则。”

We have summarized a text of 600 words into 65 words.

我们将600个单词的文本概括为65个单词。

这是整个代码 (Here Is the Whole Code)

text = "Text you want to summarize"
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
clean_text = text.lower()
word_tokenize = clean_text.split()
#english stopwords
stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
word2count = {}
for word in word_tokenize:
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1
sent2score = {}
for sentence in sentences:
    for word in  sentence.split():
        if word in word2count.keys():
            if len(sentence.split(' ')) < 28 and len(sentence.split(' ')) > 9:
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]
# weighted histogram
for key in word2count.keys():
    word2count[key] = word2count[key] / max(word2count.values())  
    
best_three_sentences = heapq.nlargest(3, sent2score, key=sent2score.get)
print(*best_three_sentences)

结论 (Conclusion)

There are a lot more methods of text summarization. I have demonstrated a really simple way of doing so. I hope this will help you and encourage you to learn more about natural language processing and more interesting projects.

文本摘要有很多方法。我已经展示了一种非常简单的方法。希望对您有所帮助，并鼓励您学习更多有关自然语言处理和更有趣的项目。